Finding duplicate SSNs (or other things)

I attended the Intermediate PM Reports class in Boston this week, and met some great people (learned a lot, too!). Everyone came with their own list of problems they wanted to solve. One of my classmates wanted to identify duplicate SSNs in the patient data. While she left with a working solution, I think we could take it one more step and learn a bit more about SQL in the process.

It occurs to me that I tend to assume a relatively high level of SQL knowledge when writing for this blog, and perhaps I should slow things down a bit and try to address a wider range of skill levels. Better to be inclusive and help more people, right? So in this post we’re going to learn about subqueries, and we’re going to look at two different ways of doing them.

What is a subquery? Well, if you’ve worked with SQL at all, you know what a query is. SELECT these items FROM these tables WHERE these conditions are true. Very basic.

A subquery is a query within a query, and more often than not, we’ll use it just like we would a table. That’s why the two different variations we’re going to look at are called Derived Tables and Common Table Expressions. I’m more a fan of the second, but we’re going to start with the first because it’s just easier to take them in this order.

For the purposes of what we’re trying to accomplish here, wouldn’t it be great if there was a table that held all the SSNs that appeared multiple times in the PatientProfile table? Of course there’s no reason a table like that would exist, but we can dream! Or…we can make it happen, sort of.

Let’s think about what a query that returned that information would look like:

This is a grouping query. The GROUP BY clause says I want one record for every SSN. And so the first thing I’m listing in my SELECT list is SSN. After that, everything else I ask for has to be an aggregate… either a SUM() or a MAX() or a MIN()… or in this case, a COUNT(). And the thing I’m counting is PatientProfileId. I selected this field arbitrarily; I could have used any field in PatientProfile. I could have used COUNT(PId) or COUNT(*), I could even have said COUNT(1). All of these would give me the same result. I’m only using PatientProfileId for readability purposes, giving myself a little reminder as to what it is I’m counting. If I look at this query next week, I’m going to understand immediately that I’m counting patients. Plus, I’ve given it an alias of “How Many Patients”, which helps too.

A key element here is the HAVING clause. Ready for an analogy? HAVING is to GROUP BY as WHERE is to SELECT. When you use the WHERE clause, you’re telling SQL to SELECT these columns, but only from records WHERE a certain condition is true. Now you’re telling SQL to take those records and GROUP BY SSN, but only give me the groups HAVING a certain aggregate value. In this case, you only want the groups HAVING a COUNT(PatientProfileId) greater than 1. We want these because we’re looking for SSNs that appear on more than one patient’s record, get it? We don’t need to see the SSNs that appear on only one patient.

So let’s run that and see what we get:

Hmmm. Well, I see one thing I don’t like. I have over 81,000 patients who don’t have a SSN in the system. You may or may not see that depending on whether your organization requires SSNs. Clearly mine does not, but I don’t want to include those in my project here. They’re irrelevant to me. So I’m going to make a little tweak to my query:

Much better.

So if you recall, we were fantasizing about how great it would be if this was a table. If this was a table full of all the duplicate SSNs, we could do an INNER JOIN with PatientProfile ON PatientProfile.PatientProfileID = [this non-existent table].PatientProfileId. That would let us return all the information we wanted on patients who had duplicate SSNs. Well, just because it’s not a table doesn’t mean we can’t pretend…

Do you see how we joined PatientProfile to our query, rather than to a table? What we created there is called a Derived Table. It works like a table, but we derived it from a query. Now I have the demographic data from PatientProfile for every patient who has a duplicate SSN. And by ordering the results by SSN, the matching results are grouped together. Easy? Easy.

I want to take a moment to talk about INNER JOIN vs. LEFT JOIN. If you’re experienced with SQL you can skip this paragraph. A LEFT JOIN will give you everything from Table A, whether or not there are any matching records in Table B. Nothing in Table A will be filtered out. An INNER JOIN will give you only those records from Table A where there is at least one matching record in Table B. Any Table A records that do not have a match will be filtered out. So what we’re really doing with our Derived Table is we’re using it as a filter. There’s no useful information in it apart from the list of SSNs that have duplicate patients. By INNER JOINing that Derived Table to PatientProfile, we’re telling SQL to give us demographics only on those patients who have one of those SSNs. An INNER JOIN acts as a filter – that’s what I want you to take away from this.

Two more quick notes before we move on: One, I removed the COUNT() from the Derived Table’s SELECT statement because we really don’t need it here. We don’t care about how many there are, only that there are more than one, and the HAVING clause takes care of that. Two, I made sure to alias PatientProfile differently in the derived table than in the main query. This is because SQL can easily get confused when you have the same table inside and outside of a derived table.

Okay, so that works, but I think that’s a bit difficult to read. Just as you’re in the middle of figuring out what your query is doing, you have to stop and figure out what a completely different query is doing. I would rather get my head around the subquery first, then see how it fits into the larger query. So I’m going to use a Common Table Expression, or CTE. Let me throw it out there and see if you can figure out what it does before I walk you through it:

By setting up our Common Table Expression first, we can understand what it does before we get into the meat of the main query. Either of these approaches will work, and in many cases you won’t see a significant performance difference between them. However, if you code it one way and it takes a long time to run, try it the other way and see if it makes a difference. You never know. And if neither gives you the performance you need, you may need to consider Table Variables or Temporary Tables, which are a topic for a different post.

You should be able to apply this logic to any kind of duplicates you want to look for. Just swap out the relevant fields and tables as needed. And if you need any help, don’t be afraid to ask!

Leave a Reply