Scheduling lag (or, “Fun With Recursive CTEs!”)

Hey, who wants to learn about recursive common table expressions? Today we’re going to look at a scenario where one of our practices wants to know how far out they’re scheduling appointments. For example, the user wants to know how many appointments have a lag of at least thirty days between being scheduled, and taking place. A recursive common table expression can get us there quickly and easily.

In our example, we want to see the lag between scheduling and appointment date starting with zero days and then incrementing in steps of five. In other words, first we want to see how many appointments were scheduled out zero days or more; obviously that’s going to be all of them. Then we want to know how many were five days or more, and then ten days, then fifteen days, and so on. We’re going to accomplish this by using one of the more powerful, yet overlooked, tools in the SQL arsenal: the recursive common table expression (CTE).

First let’s look at the basics of recursive CTEs. Actually, maybe we should start by looking at regular CTEs. Skip down a bit if you’re already familiar with this.

A CTE is just a subquery or derived table, but with a few superpowers. Like other subqueries, it exists within the context of a single query and then it’s gone. You can’t refer to it later in the script like you can with a table variable, for example. But unlike a regular subquery, you can refer to it in subsequent CTEs. Let’s look at a couple of examples before we get too far ahead of ourselves. We’ll start with the most useless CTE you’ll ever see, but at least it’s illustrative…

This example pulls all observation data (from OBS) for all patients with the last name Smith. The opening block sets up the CTE, then the bottom block is the actual query we’re running. This is the exact equivalent of…

Which is also useless but again, we’re only trying to illustrate what a CTE does. If I wanted to get really silly, I could use two CTEs, the second of which references the first…

Now we’re getting all observation data for every patient named David Smith. Clearly there are better ways of doing this, but now we know how to do it with CTEs, and that’s the point.

One quick note: You may notice that I stuck a semicolon before the WITH on the first CTE. This isn’t strictly necessary, but it’s a good habit to get into. If there is a statement in your batch that precedes your WITH, that statement must be terminated with a semicolon. If it isn’t, you’ll get an error. Since we don’t always remember to terminate our lines (or at least I don’t), you’ll save yourself a lot of headaches if you simply get into the habit of prefacing your WITH statements with a semicolon. Now you don’t have to worry about it.

Okay, now you know your CTE basics. Let’s get advanced and look at making them recursive.

The word “recursive” means “self-referential”. The CTE refers not only to tables in our database (and possibly to previous CTEs), it also refers to itself, like a self-join on steroids. This results in a looping behavior (called “iteration”) that’s going to come in very handy. Let’s start simple…

You can probably tell we didn’t really need a CTE for this, but let’s roll with it for now. We’re pulling all future non-canceled appointments for facility 665. Along with the AppointmentsId field, we’re also taking the difference in days between the date the appointment was scheduled ( Created) and the date of the appointment itself ( ApptStart), and called that Lag. We’ve also included a hard-coded zero and called it LagGroup. What this means is that every appointment returned by this query has a zero-days-or-greater lag between being scheduled and taking place. And like we said before, that describes all of the appointments.

Now let’s get recursive:

At the top of the CTE is our original query. This is called the anchor statement. The bottom half, which we’ve added by way of a UNION ALL, references the CTE itself. We call this the recursive statement, and it acts upon the most recent output of the CTE itself. Let’s take this in steps and see how it works:

  1. The anchor statement returns all future appointments for this facility, along with their Lag value. The zero LagGroup value indicates that all of these appointments have a zero-or-greater Lag value.
  2. The recursive statement acts on the results of step 1. All of those results have a LagGroup value of zero. So based on our WHERE clause, we’re looking for any records with a Lag value of five or greater. This result set gets a LagGroup value of LagGroup + 5, which in this case is five.
  3. The recursive statement runs again, this time acting on the results of step 2. Now LagGroup + 5 is ten, so it returns any records where Lag >= 10, and assigns them a LagGroup value of ten.
  4. The recursive statement runs again, acting on the results of step 3. Now LagGroup + 5 is fifteen, so it returns any records where Lag >= 15, and assigns them a LagGroup value of fifteen.
  5. This iterative process continues, with the recursive statement acting upon the previous result set, until the recursive statement returns zero records. In other words, LagGroup + 5 reaches a limit where there are no appointments with a greater Lag value. At this point the CTE terminates and returns its results to the main query.

So what do we end up with? First, our result set includes a list of all future appointments, with a LagGroup of zero. It also includes a list of future appointments with a Lag of five days or more, which all have a LagGroup of five. It includes a list of appointments with a lag of ten days or more, which all have a LagGroup of ten. This continues for as many LagGroup values as are needed.

What can we do with it? Well, we can group on LagGroup and get a count of appointments for each group. We can tell the user how many appointments have a lag of 30 days or more, for example. We can compare this count to the total appointment count ( LagGroup = 0, remember?) and come up with a percentage for each group…

This gives us the following (partial) result set:

LagGroup Appointments ApptPercentage
0 2103 100
5 2077 98.7636709462672
10 1958 93.1050879695673
15 1853 88.112220637185
20 1802 85.6871136471707
25 1681 79.9334284355682
30 1451 68.9966714217784
35 1338 63.623395149786

By the way, what if we didn’t want the “zero” group, the one that contains every appointment? Maybe we’re only interested in appointments that are scheduled thirty days out or more? Simple, just edit the anchor statement like so…

One quick caveat: The recursive CTE has a built-in safety feature called MAXRECURSION. This limits the number of times the recursive statement will iterate, thereby preventing infinite loops, or scenarios where the CTE runs for an excessive amount of time. The default value for MAXRECURSION is 100; if the CTE iterates more than 100 times, it throws an error. But you can override that with a hint in the main query:

This example allows the recursive statement to run up to 200 times. If it exceeds 200, you’ll get the error. Be careful with this option. If the CTE is running that many times, it’s possible you didn’t code the query so that the recursive statement will eventually return an empty result set.

I hope this post was useful to you in some way. Recursive CTEs don’t have a wide array of uses, but when they’re applicable, they’re extremely powerful.

Leave a Reply