Positional SQL window functions deal with data’s location in the set. In this post, we explain LEAD, LAG, and other positional functions.
SQL window functions allow us to aggregate data while still using individual row values. We’ve already dealt with ranking functions and the use of partitions. In this post, we’ll examine positional window functions, which are extremely helpful in reporting and summarizing data. Specifically, we’ll look at
It is worthwhile mentioning that
LAST_VALUE. In some ways, these pairs can be used interchangeably.
We’re getting into a more complex area of SQL window functions now. As we go through this post, you’ll notice that some problems are solved more easily using window functions (as compared to using nested complex subquery combinations). Don’t worry if you feel a little bit overwhelmed at first. The topic isn’t easy and it takes time to master it. Just keep learning and practicing!
Introducing SQL’s Positional Windows Functions
Once again, we’ll be working with the
sales table. This table tracks basic sales data using the following columns:
id” – The primary key of the table.
salesman_id– The ID of the person who made the sale.
sales_item– The ID of the item sold.
sales_num– The number of items sold.
sales_price– The price per individual item.
datetime– The date of the sale.
Now, let’s start with the
FIRST VALUE function.
First, we are going to look at the analytical function called
FIRST VALUE. As the name suggests, this functions returns the first value in an ordered list of values.
Suppose we want to extract the first six months of data from our
sales table, but we want to list the salesman with the biggest sale volume first for each month. To accomplish this, we use the
FIRST VALUE function with the following arguments:
first_value(salesman_id) over (partition by extract(MONTH FROM datetime) order by sales_num desc)
We are feeding the
FIRST VALUE function with the
salesman_id argument over the window of a month where the data is sorted in descending order based on the
sales_num” column. We expect the salesman ID with the largest associated sale volume to be listed first for each month.
SELECT salesman_id, sales_item, sales_num, sales_price, extract(MONTH FROM datetime) month, first_value(salesman_id) over (partition by extract(MONTH FROM datetime) order by sales_num desc) as TOP_SALESMAN FROM sales WHERE extract(MONTH FROM datetime) <= 6 ORDER BY extract(MONTH FROM datetime);
Does the result live up to the expectation?
Try It Yourself: You can simulate the results of using
FIRST_VALUE with other functions. See if you can find any without using the
LAST VALUE is the opposite of
FIRST VALUE; it returns the last value of an ordered list of values.
Let’s take the previous example and simply replace
FIRST VALUE with the
SELECT salesman_id, sales_item, sales_num, sales_price, extract(MONTH FROM datetime) month, last_value(salesman_id) over (partition by extract(MONTH FROM datetime) order by sales_num desc) as TOP_SALESMAN FROM sales WHERE extract(MONTH FROM datetime) <= 6 ORDER BY extract(MONTH FROM datetime);
We get this result:
Now the values in
salesman_id that have the smallest associated
sales_number are listed first, just as we’d expect.
Try It Yourself:
LAST VALUE and
FIRST_VALUE are interchangeable – you can receive the same result you expect from
LAST VALUE using
FIRST VALUE. I will leave it up to you to figure out how.
There, that wasn’t too bad. Let’s move on to something more complex: the
LAG functions. You should note that, like
FIRST VALUE and
LAST VALUE, these have an opposite yet interchangeable relationship. Also, know that these are usually used in calculations as helper functions that show some difference between current value and the leading or lagging value.
We’ll start with the
LAG lets you access the information in another row without using a
SELF JOIN command. You simply specify the column name (the value) and how many rows back the data is. Once again, we are using this function with a window of data.
LAG returns the value from a row located before the current row in the specified sorting order.
The syntax of the function looks like this:
LAG(value, offset, default_value)
valueis the column we want to access
offsetis the number of previous rows we skip to get to the value
default_valueis what the function returns if there is no value in the row
LAG functions is usually used in SQL queries as a “helper” function in calculations that show some difference between current value and leading/lagging value.
Let’s try a simple
LAG function in our sales dataset:
SELECT salesman_id, sales_item, sales_num, sales_price, datetime as DATE, lag(sales_num,1,0) over (partition by salesman_id order by datetime desc) as LAG_SALES FROM sales WHERE extract(MONTH FROM datetime) <= 6 ORDER BY salesman_id,datetime desc;
LAG_SALES column contains sales numbers that are lagging based on the datetime column. The column is reset for every salesman.
We can try a more complex example. We know who are our top salesmen, in terms of volume; let’s explore the difference between the sales volume of the current salesman and that of the top salesman. We’ll use a six-month timeframe again.
We’ll compute this in two steps.
In this step, we are calculating
LAG sales numbers. We are gathering
sales_numbers values, offset by 1. If we do not find any row preceding the value, we put in the default value of 0.
SELECT salesman_id, sales_item, sales_num, sales_price, extract(MONTH FROM datetime) month, leg(sales_num,1,0) over (order by sales_num desc) as LAG_SALES FROM sales WHERE extract(MONTH FROM datetime) <= 6 ORDER BY sales_num desc;
Here is the result:
Now we will compute the difference between the volume of the current row and the volume of the previous row:
SELECT salesman_id, sales_item, sales_num, sales_price, extract(MONTH FROM datetime) month, ABS(leg(sales_num,1,0) over (order by sales_num desc)-sales_num) as DIFFERENCE_SALES FROM sales WHERE extract(MONTH FROM datetime) <= 6 ORDER BY sales_num desc;
The result is:
Now we can continue our analysis on the new column and see some interesting trends. We see that the top performers in sales volume are better then the rest by couple of orders of magnitude. Where we find zero we know that the salesman are selling the same amount of goods. See the resulting data-set and try to come up with some of your own insights.
LEAD is an analytical function that returns the value of an argument column that is offset-ed by an inputted number or rows in the partition window. The function is extracting the offset downstream from the supplied execution order.
LEAD function returns the value from some row after the current row (for example: the next row, or the row two rows after current), in the specified sorting order.
So the difference is in the direction in which we take the offset-ed value.
The syntax of the function looks the same as with the
LAG function (except the keyword):
LEAD(value, offset, default_value)
valueis the column we are calculating upon
offsetis the number of rows lagging we skip to get to the value
default_valueis the value the function returns if there is no value left in the lead
To demonstrate this function let’s get the same result we got from the
LAG example. To achieve this we will change one small thing besides the
This is the query:
SELECT salesman_id, sales_item, sales_num, sales_price, extract(MONTH FROM datetime) month, ABS(lead(sales_num,1,0) over (order by sales_num desc)-sales_num) as DIFFERENCE_SALES FROM sales WHERE extract(MONTH FROM datetime) <= 6 ORDER BY sales_num desc;
Can you spot the difference?
(Hint: This was the challenge in the
On your learning path, I encourage you to practice, practice, practice 😉 There is no better place for that than Vertabelo Academy.
The new functions are best mastered and practiced through the excellent hand-on courses of Vertabelo Academy. Practice coding as much as you can and use this article as a side companion, this is the best way to learn advanced SQL.