Common SQL Window Functions: Using Partitions with Ranking Functions
You’ve started your mastery of SQL window functions by learning RANK, NTILE, and other basic functions. In this article, we will explain how to use SQL partitions with ranking functions.
Mastering SQL window (or analytical) functions is a bumpy road, but it helps to break the journey into logical stages that build on each other. In the previous Common SQL Functions article, you learned about the various rank functions, which are the most basic form of window functions. In this article, we will build on that knowledge, using the concept of partitions. This will expand the scope and complexity of the statements from the previous article.
We will start by explaining the PARTITION BY
clause, and then we will show how to use it with ranking functions.
What Is PARTITION BY?
Partitioning is optional part of a SQL window function command. Most analytical functions, including RANGE
, can be written with a PARTITION BY
clause:
<AnalyticFunction> ([<column-list>]) over ([<partition by>])
What does a PARTITION do? It defines a partition on which the window function does some work (e.g. executes).
It is important to note the similarities between GROUP BY
and PARTITION BY
parts of the SQL statements. But be careful not to confuse the two. GROUP BY defines aggregation groups on the level of the SQL statement and PARTITION BY
defines virtual PARTITIONS that are needed for analytical function output. This may seem confusing at first but after the examples I am sure that things will clear up.
Note:
Window functions do not change the number of rows you get with your select statement. This is one obvious difference between them and group by statement.
Ranking Functions
As stated in the previous article ranking analytical functions “rank” data based on the “ordering” you specify in the function specification. Now we expand this statement, so ranking analytical functions rank data based on the ordering you specify in the function specification inside the specified rank partition.
To demonstrate various rank analytical functions with partition we will revisit our simple sales table.
In this table we are tracking sales. The columns are:
id
: primary key of the salessalesman_id
: id of the sales man how made the salesmansales_item
: id of the sold itemsales_num
: number of items soldsales_price
: price of the individual itemdatetime
: date of the sale
Rank SQL
We can now redefine RANK
as a window function that returns a rank or a number based on the ordering of the rows by some condition inside a predefined partition.
Let’s say we want to rank sales items for every salesman from most sold to least sold. Take into account the wording ‘for … salesman’. “For something” usually defines the partition keys that we need in our SQL statement. In all our simple examples we will look at sales form March (the third month).
SELECT id, salesman_id, sales_item, sales_num, sales_price, rank() over (partition by salesman_id order by sales_num desc) as RANK FROM sales WHERE EXTRACT(MONTH FROM datetime) = 3 ORDER BY salesman_id,sales_num desc;
And the result set is displayed as :
ID | SALESMAN_ID | SALES_ITEM | SALES_NUM | SALES_PRICE | RANK |
---|---|---|---|---|---|
8 | 10 | 10 | 50 | 55.1 | 1 |
7 | 10 | 200 | 20 | 1.1 | 2 |
1 | 10 | 100 | 4 | 50.5 | 3 |
9 | 30 | 25 | 330 | 12.1 | 1 |
2 | 30 | 200 | 7 | 50 | 2 |
3 | 40 | 200 | 40 | 10.5 | 1 |
4 | 50 | 200 | 15 | 110 | 1 |
5 | 60 | 200 | 8 | 30.5 | 1 |
6 | 70 | 200 | 22 | 10.5 | 1 |
As you can see from our result set the salesmen with the id of 10 and 30 have more than one sale in the table currently. We have ordered their sails with the decreasing function so that the larger sales have a smaller rank.
If we where to change the order to asc or just leave it empty, ORDER BY
is mandatory, or larger sales would have a larger rank.
SELECT id, salesman_id, sales_item, sales_num, sales_price, rank() over (partition by salesman_id order by sales_num ) as RANK FROM sales WHERE EXTRACT(MONTH FROM datetime) = 3 ORDER BY salesman_id,sales_num ;
ID | SALESMAN_ID | SALES_ITEM | SALES_NUM | SALES_PRICE | RANK |
---|---|---|---|---|---|
1 | 10 | 100 | 4 | 50.5 | 1 |
7 | 10 | 200 | 20 | 1.1 | 2 |
8 | 10 | 10 | 50 | 55.1 | 3 |
2 | 30 | 200 | 7 | 50 | 1 |
9 | 30 | 25 | 330 | 12.1 | 2 |
3 | 40 | 200 | 40 | 10.5 | 1 |
4 | 50 | 200 | 15 | 110 | 1 |
5 | 60 | 200 | 8 | 30.5 | 1 |
6 | 70 | 200 | 22 | 10.5 | 1 |
Let’s take a look at a more complex example. Let’s say that we need to see all the sales statistics of the best salesman of the month for the last year. So how do we define the best salesman?
The best salesman is the salesman who had the “biggest” sales (sales amount times sales price) by dollar amount in the month.
So if we look at our example:
SELECT * FROM (SELECT id, salesman_id, sales_item, sales_num, sales_price, rank() over (partition by extract(month from datetime) order by sales_num*sales_price ) as salesman_rank_list, extract(month from datetime) sales_month, sales_num*sales_price sales FROM sales ORDER BY sales_month,salesman_rank_list) ranked WHERE salesman_rank_list = 1;
And filter in the outer WHERE
clause the first ranked salesman we get our result:
ID | SALESMAN _ID |
SALES _ITEM |
SALES _NUM |
SALES _PRICE |
SALESMAN_ RANK_LIST |
SALES_ MONTH |
SALES |
---|---|---|---|---|---|---|---|
15 | 70 | 80 | 2 | 60.5 | 1 | 1 | 121 |
24 | 30 | 25 | 50 | 26.1 | 1 | 2 | 1305 |
7 | 10 | 200 | 20 | 1.1 | 1 | 3 | 22 |
17 | 10 | 40 | 40 | 5.1 | 1 | 4 | 204 |
18 | 30 | 25 | 30 | 2.1 | 1 | 5 | 63 |
13 | 50 | 600 | 215 | 10 | 1 | 6 | 2150 |
12 | 40 | 2 | 40 | 17.5 | 1 | 7 | 700 |
26 | 10 | 50 | 54 | 50.5 | 1 | 8 | 2727 |
14 | 60 | 50 | 8 | 302.5 | 1 | 9 | 2420 |
19 | 70 | 80 | 2 | 60.5 | 1 | 10 | 121 |
21 | 10 | 40 | 40 | 5.1 | 1 | 11 | 204 |
22 | 30 | 25 | 30 | 2.1 | 1 | 12 | 63 |
As you saw in this example the RANK
function can be used not only for reporting but for constructing more powerful (and complex) SQL statements.
ROW NUMBER
We have covered the specifics of ROW NUMBER
analytical function in the previous article, now let’s look at how it handles the partition clause.
Although it is not deterministic this function has it’s uses, one of the best know uses is the “elimination of duplicate rows”. So let’s say that, by some mistake we insert the same double sales for the salesman 40:
ID | SALESMAN _ID |
SALES _ITEM |
SALES _NUM |
SALES _PRICE |
DATETIME | R_FLAG |
---|---|---|---|---|---|---|
3 | 40 | 200 | 40 | 10.5 | 02-MAR-17 | 1 |
27 | 40 | 200 | 40 | 10.5 | 02-MAR-17 | 2 |
The ROW_NUM
still flags every for, column R_FLAG
, with a different number. We can use this information to delete duplicate rows.
The query for deleting duplicate rows is :
DELETE FROM sales WHERE rowid IN (SELECT rid FROM (SELECT rowid rid, id, salesman_id, sales_item, sales_num, sales_price, DATETIME, row_number() over (partition BY salesman_id,sales_item,sales_num,sales_price order by id) r_flag FROM sales WHERE EXTRACT(MONTH FROM DATETIME) = 3 AND SALESMAN_ID = 40 ) WHERE r_flag != 1 );
Note:
In this query we are using the Oracle ROWID
pseudocolumn. For each row in the database, the ROWID pseudocolumn returns the address of the row. Other databases have similar pseudocolumns (PostgreSQL CTID, MSSQL RID, etc.)
NTILE
Let’s look how the final ranking analytical function handles partitions. Quick recap NTILE
function divides rows of data into equal groups and returns the rank of those groups.
A simple and common use case would be if we specified the business value of particular sales volume. So we could divide the sales volume, defined with sales_num
, into four groups by every product:
- High
- Medium-High
- Medium-Low
- Low
We will take our example with the sales and expand it using NTILE
function with the function argument of 4. With this we are dividing sales of each partition (one salesman is one partition) into 4 equal parts. Equal in number of rows.
SELECT id, salesman_id, sales_item, sales_num, sales_price, case when NTILE = 1 THEN 'High' when NTILE = 2 THEN 'Middle-High' when NTILE = 3 THEN 'Middle-Low' when NTILE = 4 THEN 'Low' end sales_group, ntile FROM (SELECT id, salesman_id, sales_item, sales_num, sales_price, NTILE(4) over (partition by sales_item order by sales_num desc) as ntile FROM sales WHERE EXTRACT(MONTH FROM datetime) = 3) ORDER BY sales_item,sales_num desc;
The result we generate is:
ID | SALESMAN _ID |
SALES _ITEM |
SALES _NUM |
SALES _PRICE |
SALES _GROUP |
NTILE |
---|---|---|---|---|---|---|
8 | 10 | 10 | 50 | 55.1 | High | 1 |
9 | 30 | 25 | 330 | 12.1 | High | 1 |
1 | 10 | 100 | 4 | 50.5 | High | 1 |
3 | 40 | 200 | 40 | 10.5 | High | 1 |
6 | 70 | 200 | 22 | 10.5 | High | 1 |
10 | 10 | 200 | 20 | 2.1 | Middle-High | 2 |
7 | 10 | 200 | 20 | 1.1 | Middle-High | 2 |
4 | 50 | 200 | 15 | 110 | Middle-Low | 3 |
5 | 60 | 200 | 8 | 30.5 | Middle-Low | 3 |
2 | 30 | 200 | 7 | 50 | Low | 4 |
As we see from the result set NTILE
divided the partition” of sales_item
ID 200 into four parts, with the last group being the smallest.
This article is expanding your knowledge of window functions to a new level of complexity. For practice, you could solve the complex examples in this article using only aggregation and self-joins – give it a try! Regarding window functions this is not the end, there are more analytical functions and even SQL patterns to come in upcoming articles. In the meanwhile be sure to practice your new knowledge of partition on Vertabelo Academy.