Useful SQL Patterns: Conditional Summarization with CASE

As you start coding in SQL, you will use some statements and techniques over and over again. We call these “SQL patterns”. This series will look at the most common SQL patterns and consider how to use them.

Previously, we looked at the SQL pattern of matching NULLs. This is important when you are comparing columns containing NULL values. Today, we’re going to consider another SQL practice: conditional summarization with CASE operator.

What is Conditional Summarization?

When you are using aggregate functions to create report queries, you’ll frequently find yourself using conditional summarization with the CASE operator. Remember that CASE returns a value based on defined criteria. (For more about the CASE expression, see this post and this one.) When you perform a summarization with CASE, you’re simply adding up (summarizing) values that meet the CASE expression.

Obviously, you’ll be using the SUM part of the query to aggregate these values. I know this sounds complicated, but it is not. Let’s use a simple example to explain it.

We’ll start by considering a basic transactions table that contains transactional data from a small company.

transactions table

The transactions table has these columns:

  • id – A unique identifier for each transaction
  • datetime – The timestamp for the transaction
  • customer – The customer’s ID
  • creditcard – The ID of the type of credit card used
  • amount – The transaction amount, in dollars
  • account – The customer’s account number
  • type – The transaction type

Here’s the data we’d find in a table like this:

datetime customer creditcard amount account type
2017-01-01 00:00:00.000000 1 1 100 1 type_1
2017-03-01 00:00:00.000000 2 1 350 1 type_1
2017-05-01 00:00:00.000000 3 1 10 1 type_3
2017-02-01 00:00:00.000000 2 1 10 1 type_2
2017-05-01 00:00:00.000000 2 1 10 1 type_1
2017-04-01 00:00:00.000000 3 1 600 1 type_3
2017-01-01 00:00:00.000000 3 1 350 1 type_3
2017-03-01 00:00:00.000000 1 1 150 1 type_1
2017-04-01 00:00:00.000000 1 1 200 1 type_1
2017-02-01 00:00:00.000000 1 1 50 1 type_2
2017-05-01 00:00:00.000000 1 1 210 1 type_2
2017-04-01 00:00:00.000000 2 1 600 1 type_3
2017-01-01 00:00:00.000000 2 1 100 1 type_1

We want to find the sum of transaction amounts and the number of transactions completed before April 1st. Furthermore, we wanted these listed by individual customer. We could achieve this using the following query :

	SELECT
 customer,
 SUM(
 CASE WHEN datetime >= TIMESTAMP '2017-04-01'
 THEN amount
 ELSE 0
 END
 )        AS sum_amount_after,
 SUM(CASE WHEN datetime >= TIMESTAMP '2017-04-01'
 THEN 1
 ELSE 0
 END) AS transaction_count_after,
 SUM(CASE WHEN datetime < TIMESTAMP '2017-04-01'
 THEN amount
 ELSE 0
 END) AS sum_amount_prior,
 SUM(CASE WHEN datetime < TIMESTAMP '2017-04-01'
 THEN 1
 ELSE 0
 END) AS transaction_count_prior
 FROM transactions
 WHERE datetime BETWEEN TIMESTAMP '2017-01-01'  AND '2017-05-01'
 GROUP BY customer

The result of this query is:

customer sum_amount
_after
transaction
_count_after
sum
_amount_prior
transaction
_count_prior
2 610 2 460 3
1 410 2 300 3
3 610 2 350 1

Wait! How Did That Work?

This long query can be confusing, so let’s break it down a little. We will concentrate first on the part dealing with the April 1 cutoff (2017-04-01).

Below, we are looking at the transaction amount for customer ‘1’. Any transaction amounts that posted before 01.04.2017 will be set to “0”. We will name this column amount_after.

SELECT
 customer,
 datetime,
 CASE WHEN datetime >= TIMESTAMP '2017-04-01'
  THEN amount
 ELSE 0
 END
  AS amount_after,
 amount
FROM transactions
WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01'
      AND customer = '1'
ORDER BY datetime;
customer datetime amount_after amount
1 2017-01-01 00:00:00.000000 0 100
1 2017-02-01 00:00:00.000000 0 50
1 2017-03-01 00:00:00.000000 0 150
1 2017-04-01 00:00:00.000000 200 200
1 2017-05-01 00:00:00.000000 210 210

Alternately, we can replace the zeros shown in the results with a NULL in the ELSE statement:

SELECT
 customer,
 datetime,
 CASE WHEN datetime >= TIMESTAMP '2017-04-01'
  THEN amount
 ELSE null
 END
  AS amount_after,
 amount
FROM transactions
WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01'
      AND customer = '1'
ORDER BY datetime;
customer datetime amount_after amount
1 2017-01-01 00:00:00.000000 100
1 2017-02-01 00:00:00.000000 50
1 2017-03-01 00:00:00.000000 150
1 2017-04-01 00:00:00.000000 200 200
1 2017-05-01 00:00:00.000000 210 210

Now if we summarize these columns, we will get:

  1. The total of all transactions for sum(amount).
  2. The sum of all transactions that posted after 01.04. Any transactions that posted before 01.04 are set to zero (or NULL) for sum(amount_after).

If we want to count how many transactions were posted after 01.04, we can modify the query and create a COUNT statement that uses the same CASE with NULL in the ELSE.

SELECT
  customer,
  count(CASE WHEN datetime >= TIMESTAMP '2017-04-01'
    THEN amount
        ELSE NULL
        END)
    AS count_after
FROM transactions
WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01'
GROUP BY customer
ORDER BY customer;
customer count_after
1 2
2 2
3 2

Note: This query is really fast, as the RDBMS only needs to access one table. Building aggregation queries that use just one table is a good way to get results quickly.

Try using CASE with a zero in the COUNT statement. What is the result and why? Tell us in the comments section below.

Data Warehouse Architect

GET ACCESS TO EXPERT SQL CONTENT!