Common SQL Window Functions: Positional Functions

Window Functions, analytics, SQL for advanced, aggregate functions, SQL clauses, positional functions

Positional SQL window functions deal with data’s location in the set. In this post, we explain LEAD, LAG, and other positional functions.

SQL window functions allow us to aggregate data while still using individual row values. We’ve already dealt with ranking functions and the use of partitions. In this post, we’ll examine positional window functions, which are extremely helpful in reporting and summarizing data. Specifically, we’ll look at LAG, LEAD, FIRST_VALUE and LAST_VALUE.

It is worthwhile mentioning that LEAD mirrors LAG and FIRST_VALUE mirrors LAST_VALUE. In some ways, these pairs can be used interchangeably.

We’re getting into a more complex area of SQL window functions now. As we go through this post, you’ll notice that some problems are solved more easily using window functions (as compared to using nested complex subquery combinations). Don’t worry if you feel a little bit overwhelmed at first. The topic isn’t easy and it takes time to master it. Just keep learning and practicing!

Introducing SQL’s Positional Windows Functions

Once again, we’ll be working with the sales table. This table tracks basic sales data using the following columns:

  • id” – The primary key of the table.
  • salesman_id – The ID of the person who made the sale.
  • sales_item – The ID of the item sold.
  • sales_num – The number of items sold.
  • sales_price – The price per individual item.
  • datetime – The date of the sale.

Now, let’s start with the FIRST VALUE function.

FIRST VALUE

First, we are going to look at the analytical function called FIRST VALUE. As the name suggests, this functions returns the first value in an ordered list of values.

Suppose we want to extract the first six months of data from our sales table, but we want to list the salesman with the biggest sale volume first for each month. To accomplish this, we use the FIRST VALUE function with the following arguments:

first_value(salesman_id) over (partition by extract(MONTH FROM datetime) order by sales_num desc)

We are feeding the FIRST VALUE function with the salesman_id argument over the window of a month where the data is sorted in descending order based on the sales_num” column. We expect the salesman ID with the largest associated sale volume to be listed first for each month.

SELECT 
  salesman_id,
  sales_item,
  sales_num,
  sales_price,
  extract(MONTH FROM datetime) month,
  first_value(salesman_id) over (partition by extract(MONTH FROM datetime) order by sales_num desc) as TOP_SALESMAN
FROM sales
WHERE extract(MONTH FROM datetime) <= 6
ORDER BY extract(MONTH FROM datetime);

Does the result live up to the expectation?

SALESMAN
_ID
SALES
_ITEM
SALES
_NUM
SALES
_PRICE
MONTH TOP_
SALESMAN
10 8 22 160.5 1 10
70 80 2 60.5 1 10
10 6 550 16.1 2 10
30 25 50 26.1 2 10
30 25 330 12.1 3 30
10 10 50 55.1 3 30
40 200 40 10.5 3 30
70 200 22 10.5 3 30
10 200 20 2.1 3 30
10 200 20 1.1 3 30
50 200 15 110 3 30
60 200 8 30.5 3 30
30 200 7 50 3 30
10 100 4 50.5 3 30
10 40 40 5.1 4 10
30 25 30 2.1 5 30
50 600 215 10 6 50

Try It Yourself: You can simulate the results of using FIRST_VALUE with other functions. See if you can find any without using the RANK function.

LAST VALUE

LAST VALUE is the opposite of FIRST VALUE; it returns the last value of an ordered list of values.

Let’s take the previous example and simply replace FIRST VALUE with the LAST VALUE:

SELECT 
  salesman_id,
  sales_item,
  sales_num,
  sales_price,
  extract(MONTH FROM datetime) month,
  last_value(salesman_id) over (partition by extract(MONTH FROM datetime) order by sales_num desc) as TOP_SALESMAN
FROM sales
WHERE extract(MONTH FROM datetime) <= 6
ORDER BY extract(MONTH FROM datetime);

We get this result:

SALESMAN
_ID
SALES
_ITEM
SALES
_NUM
SALES
_PRICE
MONTH TOP_
SALESMAN
10 8 22 160.5 1 10
70 80 2 60.5 1 70
10 6 550 16.1 2 10
30 25 50 26.1 2 30
30 25 330 12.1 3 30
10 10 50 55.1 3 10
40 200 40 10.5 3 40
70 200 22 10.5 3 70
10 200 20 2.1 3 10
10 200 20 1.1 3 10
50 200 15 110 3 50
60 200 8 30.5 3 60
30 200 7 50 3 30
10 100 4 50.5 3 10
10 40 40 5.1 4 10
30 25 30 2.1 5 30
50 600 215 10 6 50

Now the values in salesman_id that have the smallest associated sales_number are listed first, just as we’d expect.

Try It Yourself: LAST VALUE and FIRST_VALUE are interchangeable – you can receive the same result you expect from LAST VALUE using FIRST VALUE. I will leave it up to you to figure out how.

There, that wasn’t too bad. Let’s move on to something more complex: the LEAD and LAG functions. You should note that, like FIRST VALUE and LAST VALUE, these have an  opposite yet interchangeable relationship. Also, know that these are usually used in calculations as helper functions that show some difference between current value and the leading or lagging value.

We’ll start with the LAG function.

LAG

LAG lets you access the information in another row without using a SELF JOIN command. You simply specify the column name (the value) and how many rows back the data is. Once again, we are using this function with a window of data.

Important: LAG returns the value from a row located before the current row in the specified sorting order.

The syntax of the function looks like this:

LAG(value, offset, default_value)

  • value is the column we want to access
  • offset is the number of previous rows we skip to get to the value
  • default_value is what the function returns if there is no value in the row

LEAD / LAG functions is usually used in SQL queries as a “helper” function in calculations that show some difference between current value and leading/lagging value.

Let’s try a simple LAG function in our sales dataset:

 SELECT 
  salesman_id,
  sales_item,
  sales_num,
  sales_price,
  datetime as DATE,
  lag(sales_num,1,0) over (partition by salesman_id order by datetime desc) as LAG_SALES
FROM sales
WHERE extract(MONTH FROM datetime) <= 6
ORDER BY salesman_id,datetime desc;

Here the LAG_SALES column contains sales numbers that are lagging based on the datetime column. The column is reset for every salesman.

SALESMAN
_ID
SALES
_ITEM
SALES
_NUM
SALES
_PRICE
DATE LAG_
SALES
10 40 40 5.1 14-APR-17 0
10 200 20 2.1 20-MAR-17 40
10 10 50 55.1 14-MAR-17 20
10 200 20 1.1 07-MAR-17 50
10 100 4 50.5 01-MAR-17 20
10 6 550 16.1 07-FEB-17 4
10 8 22 160.5 15-JAN-17 550
30 25 30 2.1 25-MAY-17 0
30 25 330 12.1 25-MAR-17 30
30 200 7 50 01-MAR-17 330
30 25 50 26.1 03-FEB-17 7
40 200 40 10.5 02-MAR-17 0
50 600 215 10 03-JUN-17 0
50 200 15 110 03-MAR-17 215
60 200 8 30.5 04-MAR-17 0
70 200 22 10.5 05-MAR-17 0
70 80 2 60.5 05-JAN-17 22

We can try a more complex example. We know who are our top salesmen, in terms of volume; let’s explore the difference between the sales volume of the current salesman and that of the top salesman. We’ll use a six-month timeframe again.

We’ll compute this in two steps.

Step 1

In this step, we are calculating LAG sales numbers. We are gathering sales_numbers values, offset by 1. If we do not find any row preceding the value, we put in the default value of 0.

SELECT 
  salesman_id,
  sales_item,
  sales_num,
  sales_price,
  extract(MONTH FROM datetime) month,
	leg(sales_num,1,0) over (order by sales_num desc) as LAG_SALES
FROM sales
WHERE extract(MONTH FROM datetime) <= 6
ORDER BY sales_num desc;

Here is the result:

SALESMAN
_ID
SALES
_ITEM
SALES
_NUM
SALES
_PRICE
MONTH LAG
_SALES
10 6 550 16.1 2 0
30 25 330 12.1 3 550
50 600 215 10 6 330
30 25 50 26.1 2 215
10 10 50 55.1 3 50
40 200 40 10.5 3 50
10 40 40 5.1 4 40
30 25 30 2.1 5 40
10 8 22 160.5 1 30
70 200 22 10.5 3 22
10 200 20 2.1 3 22
10 200 20 1.1 3 20
50 200 15 110 3 20
60 200 8 30.5 3 15
30 200 7 50 3 8
10 100 4 50.5 3 7
70 80 2 60.5 1 4

Step 2

Now we will compute the difference between the volume of the current row and the volume of the previous row:

SELECT 
  salesman_id,
  sales_item,
  sales_num,
  sales_price,
  extract(MONTH FROM datetime) month,
	ABS(leg(sales_num,1,0) over (order by sales_num desc)-sales_num) as DIFFERENCE_SALES
FROM sales
WHERE extract(MONTH FROM datetime) <= 6
ORDER BY sales_num desc;

The result is:

SALESMAN
_ID
SALES
_ITEM
SALES
_NUM
SALES
_PRICE
MONTH DIFFERENCE
_SALES
10 6 550 16.1 2 550
30 25 330 12.1 3 220
50 600 215 10 6 115
30 25 50 26.1 2 165
10 10 50 55.1 3 0
40 200 40 10.5 3 10
10 40 40 5.1 4 0
30 25 30 2.1 5 10
10 8 22 160.5 1 8
70 200 22 10.5 3 0
10 200 20 2.1 3 2
10 200 20 1.1 3 0
50 200 15 110 3 5
60 200 8 30.5 3 7
30 200 7 50 3 1
10 100 4 50.5 3 3
70 80 2 60.5 1 2

Now we can continue our analysis on the new column and see some interesting trends. We see that the top performers in sales volume are better then the rest by couple of orders of magnitude. Where we find zero we know that the salesman are selling the same amount of goods. See the resulting data-set and try to come up with some of your own insights.

LEAD

LEAD is an analytical function that returns the value of an argument column that is offset-ed by an inputted number or rows in the partition window. The function is extracting the offset downstream from the supplied execution order.

Like LAG, the LEAD function returns the value from some row after the current row (for example: the next row, or the row two rows after current), in the specified sorting order.

So the difference is in the direction in which we take the offset-ed value.

The syntax of the function looks the same as with the LAG function (except the keyword):

LEAD(value, offset, default_value)

  • value is the column we are calculating upon
  • offset is the number of rows lagging we skip to get to the value
  • default_value is the value the function returns if there is no value left in the lead

To demonstrate this function let’s get the same result we got from the LAG example. To achieve this we will change one small thing besides the LAG keyword.

This is the query:

SELECT 
  salesman_id,
  sales_item,
  sales_num,
  sales_price,
  extract(MONTH FROM datetime) month,
	ABS(lead(sales_num,1,0) over (order by sales_num desc)-sales_num) as DIFFERENCE_SALES
FROM sales
WHERE extract(MONTH FROM datetime) <= 6
ORDER BY sales_num desc;

Can you spot the difference?

(Hint: This was the challenge in the FIRST_VALUE / LAST_VALUE demonstration.)

Conclusion

On your learning path, I encourage you to practice, practice, practice 😉 There is no better place for that than Vertabelo Academy.

The new functions are best mastered and practiced through the excellent hand-on courses of Vertabelo Academy. Practice coding as much as you can and use this article as a side companion, this is the best way to learn advanced SQL.

Data Warehouse Architect

GET ACCESS TO EXPERT SQL CONTENT!