GROUP BY is an important part of the SQL SELECT statement. But new SQL coders can run into some problems when this clause is used incorrectly. Here’s how to avoid those issues.
You’re learning SQL. You know how to SELECT some data FROM a table and how to filter the data with a WHERE clause. You can process the data using aggregate functions (MIN, MAX, SUM, AVG, and others). But when you’re dealing with a lot of data, you may need to narrow it down even more. That’s where the GROUP BY clause comes in – it lets you organize information based on parameters you set.
In this article, we’ll explain how you can use GROUP BY. We’ll also talk about a couple of possible problems and how to avoid or fix them.
Using GROUP BY
To understand how to use GROUP BY, we first need to define a table to practice on. How about one describing people?
| ID | Name | Gender | Height | Weight | Eye_color | -------------------------------------------------------- | 1 | Mark | Male | 180 | 78 | Blue | | 2 | Susan | Female | 172 | 59 | Brown | | 3 | Thomas | Male | 179 | 92 | Brown | | 4 | Katarina | Female | 164 | 53 | Green | | 5 | Mindy | Female | 170 | 58 | Blue | --------------------------------------------------------
So we have our practice table. Now we get into the statistics. For example, what is the average height of all our people? To find out, we type:
SELECT AVG(Height) FROM People
+-------------+ | AVG(Height) | +-------------+ | 173.0000 | +-------------+
Now suppose we want to know the average height by gender. That appears to be quite easy; we simply add a WHERE clause. So we type:
SELECT AVG(Height) FROM People WHERE Gender = ‘Male’
But what if we introduce other genders into our table? In that case, we’d have to write additional queries and manually collect the data we need. It’s easier to GROUP our data BY
Gender and then compute the average height for each group, as shown below.
SELECT Gender, AVG(Height) FROM People GROUP BY Gender
+--------+-------------+ | Gender | AVG(Height) | +--------+-------------+ | Female | 168.6667 | | Male | 179.5000 | +--------+-------------+
Grouping seems easy, right? All you need to do is add the GROUP BY clause to your SQL statement. Suppose, though, we need to have two parameters in our search. In that case, we’d need to group by two columns. Let’s say we want to know how many men and women have blue, brown, or green eyes. We’d type:
SELECT Gender, Eye_color, COUNT(*) FROM People GROUP BY Gender, Eye_color
+--------+-----------+----------+ | Gender | Eye_color | COUNT(*) | +--------+-----------+----------+ | Female | Blue | 1 | | Female | Brown | 1 | | Female | Green | 1 | | Male | Blue | 1 | | Male | Brown | 1 | +--------+-----------+----------+
This is just an introduction to GROUP BY. You can use it in many ways. Why not try experimenting with different aggregate functions (like AVG and COUNT) to get a better grasp of GROUP BY?
Common GROUP BY Errors
Although GROUP BY seems fairly easy to use, it’s common for SQL newbies to find themselves confronted by some confusing error messages. Here are some we come across quite often:
1. Selecting Multiple Values
The most important thing to remember when using GROUP BY is that whatever you’re going to SELECT must be a single value. That’s why we need to use an aggregate function: it takes multiple values and produces a single one to return.
To make it a bit clearer, let’s take a look at our groups:
+--------+--------+ | Gender | Height | +--------+--------+ | Male | 180 | | | 179 | | Female | 172 | | | 164 | | | 170 | +--------+--------+
When we ask for Height grouped by
Gender, we want to get a single value. But here
Male has two values for height and
Female has three. Which should we choose?
SELECT Gender, Height FROM People GROUP BY Gender;
(MYSQL) ERROR 1055 (42000): Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'vertabelo.People.Height' which is not functionally dependent on columns in GROUP BY clause;
That’s why we need to use an aggregate function (such as AVG, which we used earlier) to get a single value.
2. Using WHERE to Filter Results
Suppose we only want to see eye-color results for groups of people taller than 170 cm. If we try to put that into the WHERE clause, as shown below:
SELECT Gender, Eye_color, COUNT(*) FROM People WHERE AVG(Height) > 170 GROUP BY Gender, Eye_color
We get the error message shown below:
(MYSQL) ERROR 1111 (HY000): Invalid use of group function
That’s because the database is grouping our records after filtering them. We want them to filter the result of the GROUP BY statement. To do this, we use another clause called HAVING. It comes after GROUP BY and works like WHERE. The difference is you can use aggregate functions in it. To get the desired result set, we’d write this code:
SELECT Gender, Eye_color, COUNT(*) FROM People GROUP BY Gender, Eye_color HAVING AVG(Height) > 170
Remember that in the database, an SQL SELECT statement always executes in this order:
- GROUP BY
- ORDER BY
We take a data source, filter the records, group them, filter the groups, select the columns we want and then sort them.
— Vertabelo (@Vertabelo) January 5, 2017
3. Multiple Entity Errors
This is a little more advanced. When we’ve JOINed several tables, it’s quite possible that some entities will have the same names. So we’ll often want results grouped by entity ID rather than the entity’s name.
For example, suppose you are looking at city data for Warsaw. There’s a Warsaw that’s the capital of Poland and a Warsaw that’s in the state of Indiana, USA.
So let’s say we have a table defined like this:
Table City: --------------------- | ID | Name | Award | ---------------------
This table describes cities that were given one or more awards. A city is identified by its ID and it can have many awards.
If we wanted to see the number of awards received by a city and have this information grouped by city name, perhaps we would use this query:
SELECT City.Name, COUNT(Award) FROM City GROUP BY City.Name
The problem is that by using
City.Name, the awards for Warsaw, Poland and Warsaw, Indiana would be added together. After all, both are Warsaw! However they are different places, and as such have different
City.ID values. When you think about a city as a database entity, it is identified by its ID rather than its attributes (like
Name). If we group results by ID, we’ll get the correct information. Since we still want to display the name, we’ll use something like this:
SELECT City.Name, COUNT(Award) FROM City GROUP BY City.ID
The result will have separate entries for the different Warsaws with the values we want. (Normally, there would be another column, like “country” or “state”, to differentiate between those two cities. But for the sake of example, let’s say there isn’t.)
4. Using Values That Are “Not Aggregated”
In the above example, we’ve selected the
City.Name attribute and grouped results by the
City.ID attribute. In our table, every record with the same ID also has the same city name. Some databases will have no problem with this and will return the expected results, but others will give you an error saying that
City.Name is not in the
GROUP BY clause and that it is not aggregated. I’ve reproduced the MySQL error message below:
(MYSQL) ERROR 1055 (42000): Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'vertabelo.City.Name' which is not functionally dependent on columns in GROUP BY clause;
But it is a single value! How do we fix this error? Well, since one
City.ID means exactly one
City.Name, we can put both of them in the
GROUP BY clause:
SELECT City.Name, COUNT(*) FROM City JOIN Person ON (Person.CityID = City.ID) GROUP BY City.ID, City.Name
This should fix the problem.
Practice Makes Perfect
If you’re still not getting how or when to use the GROUP BY clause or if you want to practice using it, check out our SQL Queries course. It covers all the basics as well as the more advanced stuff, so you’ll have lots of opportunities to perfect your SQL skills.