What is an SQL self-join and how does it work? When should it be used? We’ll provide answers to those questions!
In SQL, we can combine data from multiple tables by using a
JOIN has several variants; we’ve already discussed CROSS JOIN, INNER JOIN, and OUTER JOIN. Most of the time, these operators join data from two or more different tables. In this article, however, we will explain how to join records from the same table.
A self-join joins data from the same table. In other words, it joins a table with itself. Records taken from the table are matched to other records from the same table. Why would you do this? You may need to compare a value with another value from the same row. You can’t do that unless you join the table to itself and compare the values as if they were in two separate records.
There is no dedicated operator for self-joins. Instead, self-joins use the INNER JOIN operator, one of the
OUTER JOIN operators, or a
CROSS JOIN. The difference is that a single table is listed as both the left and right table in the join.
Let’s look at an example based on the
employee table, shown below:
employee table stores employees’ ID numbers, first names, last names, salaries, and their manager’s ID number. With the exception of the boss (whose
ID = 1), everyone has a supervisor. Only the boss can have a NULL value in the
Let’s try a query that returns the immediate superior of each employee:
SELECT e.id, e.first_name, e.last_name, e.salary, m.first_name AS fname_boss, m.last_name AS lname_boss FROM employee e JOIN employee m ON e.manager_id = m.id ;
Note that we used the
employee table as both the left and right table in the join. To join data from the same table, we had to assign two aliases to the table name.
You must use aliases when performing self-joins. Because you’re joining columns from the same table, they will have the same names. The alias renames the columns so the database engine can execute your query. In the query above, we used the letter
e as the alias for the left table. In this case, we are using the
employee table for its employee records. We used the letter
m as the alias for the right table, which used the
employee table for its manager records. Even though we are using one table, SQL treats it as two different tables.
We selected employee data by selecting the
last_name columns in the
e “table”. The managers’ data was selected by referring to the
m “table”. Notice that the employee records were prefixed with an “e” and the manager records were prefixed with an “m”. We could now join them because we used an appropriate joining condition: the
manager_id column from the
e table was compared to the
id column from the
m. Records from both tables were matched using a
JOIN operator (or
INNER JOIN; remember, the “INNER” is implied).
Did you notice that the boss is not shown in the results? They have no match in the right “manager” table, so no data about the boss’s superior could be found.
The image below shows how records are matched during this query execution.
Let’s analyze the data in the output table:
In our table, the
JOIN did not return a record for John Watson. He does not have an identifier pointing to his superior’s id, so he is the boss. We would have to use a
LEFT JOIN instead of an
INNER JOIN if we wanted to return data for all employees, including the boss. In that case, records from the left table with no match in the right table would be shown as well:
SELECT e.id, e.first_name, e.last_name, e.salary, m.first_name AS fname_boss , m.last_name AS lname_boss FROM employee e LEFT JOIN employee m ON e.manager_id = m.id ;
When Should We Use Self-Joins?
Self-joins are commonly used in the following areas:
- Hierarchical relationships
- Sequential relationships
- Graph data
We’ll look at each of these individually.
Processing hierarchical data is one of the most frequent applications of self-joins. This occurs when there is an additional column pointing to an identifier in the same table, such as in our
employee table. In our case, the
manager_id column refers to (has the same value as) the
The example given above (the employee-manager relationship) is not the only situation where hierarchical data is used. A bill of materials for a car shares a similar structure. Each car consists of multiple components, such as the engine, the brake system, and the electrical system. Each of these components consists of smaller parts. A car’s electrical system can be broken into its components, such as the battery and the alternator (which can be broken down into even more car parts). This means that car parts constitute a group of hierarchical data. The
car table presented below shows some car parts data.
Yet another example of hierarchical data is the parent-child relationship. By storing this relationship, we can use a single table to house an entire family tree. We can then use a self-join to easily retrieve data about the ancestors of a given person.
The table below can help us quickly identify the eldest ancestor(s) in a family. These people do not have ancestor data in their records, which means they form the root of the family tree.
person table presented below illustrates this:
The query below retrieves the first and last name of each person along with the first and last name of their mother and father.
SELECT c.first_name, c.last_name, m.first_name AS fname_mother, m.last_name AS lname_mother FROM family c LEFT JOIN person m ON c.mather_id = m.id LEFT JOIN person f ON c.father_id = f.id ;
Note that we had to use a self-join three times (i.e. on three “tables” or aliases) to get the data about both the father and the mother. John Watson, Anne Brown, and Jacob Miller do not have data pointing to their ancestors.
The examples of hierarchical structures discussed above store data using a superior-inferior record approach. This enables us to present data as a tree structure. You’ll learn more about tree structures as you continue to learn about relational databases.
Sequential data can also benefit from self-joining techniques. For example, suppose you have records that describe the consecutive steps required to prepare a dish. All of the steps can be placed in a single table. Their order is determined based on the columns that point to the IDs of the previous and next records in the same table.
Let’s illustrate this using the
|1||Preheat an oven to 220 degrees C.||NULL||2|
|2||Peel four potatoes.||1||4|
|3||Toss sliced potatoes with oil.||4||6|
|4||Cut potatoes into slices.||2||3|
|5||Season the hot slices with salt and pepper.||6||NULL|
|6||Bake in the preheated oven for 20 minutes.||3||5|
As you see, in the current order these instructions do not make sense. But when we use a self-join to show the sequential relationship for each step in the recipe, we get:
- Preheat an oven to 220 degrees C.
- Peel four potatoes.
- Cut potatoes into slices.
- Toss sliced potatoes with oil.
- Bake in the preheated oven for 20 minutes.
- Season the hot slices with salt and pepper.
We also get some tasty oven fries!
Self-joins can also be used to show the relationships needed for graphs. A graph is a structure consisting of nodes connected to each other with edges (relations). One example of a graph is the road network between multiple cities.
Take a look at the drawing below.
This graph represents five cities that are connected with each other. Each arrow shows a road from one city to another. In this case, the cities are the nodes and the roads between them are the edges. We’re using two tables to store this data. The
city table stores the ID number and the name of each city. The
route table contains the route ID number, the starting city (the
from_city_id column) and the target city (the
Here is the “city” table:
And this is the “route” table:
We can use a self-join on the
city table, along with an
INNER JOIN of the two tables, to find out what routes exist between cities.
Take a look at the query.
SELECT c1.name AS from_city, c2.name AS to_city FROM city c1 JOIN route r ON c1.id = r.from_city_id JOIN city c2 ON c2.id = r.to_city_id ;
route tables were joined using the
id column from
city and the
from_city_id column from
route. At this point, we could only retrieve the name of the start city. In order to retrieve the name of the target city, we used a self-join on the
city table. This time, we compared the
id from the aliased
city table with the
to_city_id column in the
Here is the result:
Self-joins can be used in recursive subqueries that store graphs. More information about this topic can be found in our new course, “Recursive Queries”.
Using Self-Joins to Find Duplicate Values
Self-joins can also be used to identify duplicate values in a table. Let’s introduce an example table called
Each record in the table is different because of the
id column, which must always be unique. But this doesn’t prevent two rows from storing the same color name. We want to identify such cases and find the IDs of the duplicate color names. Let’s try this:
SELECT c1.id AS id1, c1.name AS color1, c2.id AS id2, c2.name AS color2 FROM color c1 JOIN color c2 ON c1.name = c2.name AND c1.id < c2.id ;
We were able to find duplicate color names because we self-joined records based on the color name. The second condition is used to skip identical records from both tables as well as the same pairs of records in reverse order.
Take a look at the query result:
It is now easy to see that there are duplicate values for blue and yellow.
If you would like to learn more about JOINs than we covered in this illustrated guide, check out our SQL Basic course. Remember, the best way to learn SQL JOINs – or anything else – is to learn by doing!