In our previous post, we explained how SQL window functions work by example. We started with some very simple, basic functions. Let’s extend it by explaining subclauses in window functions.
SQL window functions are a great way to compute results from a set of rows rather than a single row. As you know from our first article, the “window” in window function refers to the set of rows. We showed you some examples of simple window functions like
LEAD. Today, we’re going to use racecars to show you how to use the
PARTITION BY and
ORDER BY subclauses in window functions.
That’s right, we said racecars. Everyone, start your SQL engines!
Window functions can only be used in the
ORDER BY clauses of a SQL query. In a window function, all the rows returned by the
SELECT are divided into windows (or frames). The criteria for deciding which rows belong to which window is defined by putting a
PARTITION BY subclause inside the window function invocation. The window function then computes one result based on all the rows in one window.
Let’s see an example.
The following table is a subset of Formula 1 racing results for 2016 and 2017.
Suppose we want to create a report with the pilots’ names, their time in every race (where they finished), the average time for that race, and the historical average time for that circuit. The next query will get this data for us:
SELECT pilot_name, circuit_name, year, time, AVG(time) OVER (PARTITION BY circuit_name) avg_this_circuit, AVG(time) OVER (PARTITION BY circuit_name,year) avg_this_race FROM races WHERE finish = true ORDER BY year desc, circuit_name, time
To understand how the
avg_this_circuit column was obtained, let’s look at what PARTITION BY did. It generated two windows: one for Monte Carlo and the other for Monza. Then the average was calculated based on all the rows in each set. Note that there are only two different values in the
avg_this_race column was generated in a similar way, except that the
PARTITION BY divided the data into four windows: (“Montecarlo”, 2017) (“Monza”, 2017), (“Montecarlo”, 2016) and (“Monza”, 2016).
Below, you can see the complete frames for each circuit and for each circuit-year:
First Shall Also Be Last?
For some window functions, it is important to order the rows within the window. You can do this using the
ORDER BY subclause.
Let’s see how this works. First, we’ll execute a query to obtain every pilot name, their time, their position in the race (via the
RANK function), the time of the race winner (via the
FIRST_VALUE function) and the delta time between this pilot and the winner. Here is the query:
SELECT pilot_name, circuit_name, year, time pilot_time, RANK() OVER (PARTITION BY circuit_name, year ORDER BY time) as position, FIRST_VALUE(time) OVER (PARTITION BY circuit_name, year ORDER BY time) winner_time, time - FIRST_VALUE(time) OVER (PARTITION BY circuit_name, year ORDER BY time) difference_time FROM races WHERE finish = true ORDER BY year desc, circuit_name, time
Suppose we now want to add the winner’s name and the name of the last pilot to finish the race. This is easy: we just add two expressions to the query. (Note: We’ve shown these expressions in bold text).
SELECT pilot_name, circuit_name, year, time pilot_time, RANK() OVER (PARTITION BY circuit_name, year ORDER BY time) as position, FIRST_VALUE(pilot_name) OVER (PARTITION BY circuit_name, year ORDER BY time) winner_pilot, LAST_VALUE(pilot_name) OVER (PARTITION BY circuit_name, year ORDER BY time) last_pilot FROM races WHERE finish = true ORDER BY year desc, circuit_name, time
And here are the results:
If we analyze the query results, we will find that the winner column is correct, but the
last_pilot column is not. In fact, they show the same name: Vatel. How did this happen?
Remember, the window function is calculated over a set of rows. It is dynamic, changing with every row returned by the
SELECT. At every moment when the query is working, there is a current row in the result set and the window is formed by all the rows in the partition except any rows yet to be returned. When we calculate the
last_pilot column, we are taking the last value in the window, which corresponds to the current row being added to the frame.
The following image visualizes this scenario:
Fixing the Problem
The previous issue with
last_pilot column is due to the
NTH_VALUE functions. These consider only the rows within the window frame, which only contains rows up to the current row. Thus,
LAST_VALUE does not return the correct value. In order to fix this, we need to add a frame specification (
ROWS) to the
We can select a different range in the partition by using another subclause. Some of the most important options are:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (default)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
We would add this subclause to the ORDER BY clause. Let’s see how the first option works:
SELECT pilot_name, circuit_name, year, time pilot_time, rank() OVER (PARTITION BY circuit_name, year ORDER BY time) as position, FIRST_VALUE(pilot_name) OVER (PARTITION BY circuit_name, year ORDER BY time) winner_pilot, LAST_VALUE(pilot_name) OVER (PARTITION BY circuit_name, year ORDER BY time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) last_pilot FROM races WHERE finish = true ORDER BY year desc, circuit_name, time
Learning More About SQL Window Functions
This was just a short example of how we can use subclauses to power up SQL window functions. These might be one of the least-known parts of SQL, but they are powerful, flexible, and worth spending time to master. You can learn more about SQL window functions on our blog, or sign up for the Vertabelo Academy course to really dig into the subject!