Aggregate functions are powerful SQL tools that compute numerical calculations on data, allowing the query to return summarized information about a given column or result set. These functions can be used in conjunction with the GROUP BY statement. Let’s see how they work using some easy examples.
SQL Aggregate Functions
Suppose we have users residing in a city, and we store their information in two tables. These tables and their relationship are shown below:
Let’s put some data into this model:
INSERT INTO `cities` VALUES (1,'Miami'),(2,'Orlando'), (3,'Las Vegas'),(4,'Coyote Springs'); INSERT INTO `users` VALUES (1,1,'Jhon','Doe',22), (2,1,'Albert','Thomson',15),(3,2,'Robert','Ford',65),(4,3,'Samantha','Simpson',9),(5,2,'Carlos','Bennet',42), (6,2,'Mirtha','Lebrand',81),(7,3,'Alex','Gomez',31);
Now we have enough material to explain the basic use of SQL aggregate functions. Let’s get started with a simple one.
This returns the smallest value in a given column. For example, let’s get the minimum age from our group of users:
SELECT MIN(age) FROM users;
This returns a “9”.
You can use this function to find alphabetical information, too. Let´s try it with the “
SELECT MIN(last_name) FROM users;
This returns “Bennet”, which is first in alphabetical order.
Note: The collation used to sort your data will impact the results of this query. For example, in the Danish collation, “A” is treated as “A” – the last letter in the alphabet. The Latin collation, of course, treats “A” as the first letter of the alphabet.
In a similar yet opposite fashion to
MAX returns the largest value in a column. Let’s get the maximum age from our user list:
SELECT MAX(age) FROM users;
This will return an “81”.
Let´s try the same with the last name column:
SELECT MAX(last_name) FROM users;
It returns “Thomson”, which is the last one in alphabetical order. Remember, this may change depending on the collation you are using.
This function calculates the sum of all the numeric values in a column. Let’s use it to retrieve the sum of all ages in the table:
SELECT SUM(age) FROM users;
It will return “265”.
This is used to calculate the average value for a column. Let’s see it in action, retrieving the average age of our users:
SELECT AVG(age) FROM users;
It returns a “27.75”.
This returns the number of non-NULL values in a given column. If we wanted to know how many users have told us their age, we would write:
SELECT COUNT(age) FROM users;
It will return a “7”. All records in the “
users” table have an age value. If a record didn’t have an age value, it would have been NULL (and not included in the COUNT result). If you want to count the actual number of table rows regardless of the column value, then the COUNT(*) function is what you need. Instead of specifying a column name as the function argument, we use an asterisk:
SELECT COUNT(*) FROM users;
In our case, this still returns a “7” because the table has seven records.
NULL values can be confusing, but don’t worry. We’ll show how you can deal with NULL values in SQL aggregate functions later in this article.
Now that you know what these aggregate functions do, let’s see how to make them even more useful.
Using GROUP BY with Aggregate Functions
GROUP BY statement lets us perform aggregations on a group of values based on given columns. You’ll use
GROUP BY with aggregate functions frequently, so these examples will be a little more complicated and realistic than the simple ones we used before.
GROUP BY (column_x)” means “put all records that share the same value in “
column_x” into one group”. Let’s see how this works with each of the functions we’ve already discussed.
MIN + GROUP BY
Suppose that we want to know our youngest user’s age in each city. We can see that the “
users” table has a column called “
city_id” that identifies the city where each user lives. We can use this column with a
GROUP BY statement to see the youngest age in each city:
SELECT city_id, MIN(age) FROM users GROUP BY city_id;
To better understand what is going on, look at the raw data in the “
GROUP BY will group records by the values in the “
city_id” column before computing the aggregate for each group. If you could see the GROUPed values, it would look something like this:
Then the query will grab the lowest ages within each group. If we could see this step in action, it would look like this:
The highlighted values represent the
MIN() computed values for each group.
Finally, the query will show the following results:
MAX + GROUP BY
As you may have already guessed, using MAX combined with
GROUP BY works in the same way as
MIN. It simply returns the greatest value for each group. We could compute the maximum age for each city in a similar fashion:
SELECT city_id, MAX(age) FROM users GROUP BY city_id;
This query will group users based on their “
city_id” field and then grab the maximum age value for each group. It gives us the following results:
SUM + GROUP BY
We could also calculate the sum of user ages in each city. For that, we can run the following query…
SELECT city_id, SUM(age) FROM users GROUP BY city_id;
… which will retrieve:
COUNT + GROUP BY
We may also want to calculate the number of users in each city. It’s easy to do this using the
SELECT city_id, COUNT(age) FROM users GROUP BY city_id;
Based on the last two examples, we could calculate the average age in each city by dividing the sum of all ages in a city group by the number of users for that city. Here’s how we do it:
SELECT city_id, SUM(age), COUNT(age), SUM(age) / COUNT(age) as average FROM users GROUP BY city_id;
Note: We could have also used
COUNT(*) here, since there are no records with NULL values in the “
age” column. In this case,
COUNT(age) works the same as
COUNT(*). Otherwise the value would differ, as we will explain later in the “Dealing with NULLs” section.
AVG + GROUP BY
In the previous example, we “manually” calculated the average age for each city. We could use the
AVG() function to perform this operation for us, as shown below:
SELECT city_id, AVG(age) FROM users GROUP BY city_id;
The value returned by
AVG(age) is identical to the result from the math operation we performed earlier.
Filtering Grouped Results
There are times when you need to further filter results based on conditions generated by aggregated results. Adding any conditions in the
WHERE would fail. Don’t believe me? See what happens when we try to retrieve the number of users only in cities where the average user age is greater than 20:
SELECT city_id, COUNT(age), AVG(age) FROM users WHERE AVG(age) >= 20 GROUP BY city_id;
This will make the engine (in my case, MySQL) complain. It will say something like this:
Error Code: 1111. Invalid use of group function
To filter results in this way, we need to use the
HAVING will filter summarized
GROUP BY results; the
WHERE clause only applies to individual records. If any group doesn’t meet the criteria in the
HAVING clause, they will not be returned.
So if we wanted to get the
COUNT for each city with an average age of at least 20, the proper way to handle the filtering would be like this:
SELECT city_id, COUNT(age), AVG(age) FROM users GROUP BY city_id HAVING AVG(age) >= 20
Note: The group with
city_id = 1 is discarded, as its
AVG(age) is 18.5.
HAVING can be used simultaneously without any issues (if it makes sense in your query, of course). Notice how this works in the next example. We’ll perform the calculation again, but this time we’ll exclude any user whose last name is “Simpson”:
SELECT city_id, COUNT(age), AVG(age) FROM users WHERE last_name <> 'Simpson' GROUP BY city_id HAVING AVG(age) >= 20
Notice that the
AVG calculation differs for
city_id = 3. There is a user with the last name “Simpson” for
city_id = 3, and that record was discarded due to the
WHERE last_name <> 'Simpson’ condition.
Dealing with NULLs
So far all our columns have been filled with data, but that may not always be the case. Let’s insert some records with NULL ages to explain some edge cases. The following two
INSERTs will add two new users, with an ID of 8 and 9 respectively, that have a NULL in the age column:
INSERT INTO `users` (`id`, `city_id`, `first_name`, `last_name`, `age`) VALUES ('8', '2', 'Frederic', 'Scott',NULL); INSERT INTO `users` (`id`, `city_id`, `first_name`, `last_name`, `age`) VALUES ('9', '4', 'Stacy', 'Roberts',NULL);
Records with null values are not considered in
COUNT(column). Those records are simply ignored in the calculation. A general rule of thumb is if you don’t expect to have null values or if you expect to treat NULLs as “0”, then you should set the columns as NOT NULL and set the NULL columns to “0” or whatever value makes sense for your use case.
COUNT(column) vs COUNT(*)
COUNT(column) function will not count any records with NULL age values. Let’s see this in action:
SELECT COUNT(age) FROM users;
This query returns a “7” again; the two records we added have NULL values in the “
age” column, so they are ignored. If we wanted to count every record regardless of the value, we use the
SELECT COUNT(*) FROM users;
This returns the expected result of “9”.
Let’s see how NULLs impact the
AVG + NULL
Due to the fact that
AVG will ignore records with NULLs in the specified column, the resulting value may not make sense. Here’s why.
AVG(age) function will add only the users with a non-null value in the “
age” column and divide that number against the
COUNT of users that also have a non-null “
age” value. This is 7 in our example. If you consider that the two users we just added with NULL ages should be considered for the average age calculation, then the value you will get with
AVG(age) would be wrong. The query below will show the difference in the calculations:
SELECT SUM(age), COUNT(age), AVG(age), SUM(age) / COUNT(age), COUNT(*), SUM(age) / COUNT(*) FROM users;
|SUM(age)||COUNT(age)||AVG(age)||SUM(age) / COUNT(age)||COUNT(*)||SUM(age) / COUNT(*)|
Notice how the
AVG(age) value matches the value calculated using
SUM(age)/COUNT(age); both functions don’t consider records with NULL values. But take a look at how the average value changes when using
COUNT(*), which includes all records.
We have seen that NULL values will not get computed in
COUNT(column_name) functions. If you are going to use these functions and you expect some NULL fields, make sure to set NULLs as some specific value.
Also, set the column types consistently to the value you are storing. For instance, numbers stored in a VARCHAR column will not be treated as numerical and may cause undesired results in your aggregated function computations.
Finally, I’d like to emphasize the importance of putting your filtering conditions in the right places inside your queries. Use HAVING if you need conditions based on aggregated values.
What do you think about SQL aggregate functions? Do you have any experiences or examples about their use? Please share your thoughts with our community. And don’t miss the second part of this article, where I will explain how to integrate SQL aggregate functions with JOINs. We’ll be posting it on the Vertabelo blog soon!