An Illustrated Guide to the SQL Self-Join

What is an SQL self-join and how does it work? When should it be used? We’ll provide answers to those questions!

self join in sql

In SQL, we can combine data from multiple tables by using a JOIN operator. JOIN has several variants; we’ve already discussed CROSS JOIN, INNER JOIN, and OUTER JOIN. Most of the time, these operators join data from two or more different tables. In this article, however, we will explain how to join records from the same table.

A self-join joins data from the same table. In other words, it joins a table with itself. Records taken from the table are matched to other records from the same table. Why would you do this? You may need to compare a value with another value from the same row. You can’t do that unless you join the table to itself and compare the values as if they were in two separate records.

There is no dedicated operator for self-joins. Instead, self-joins use the INNER JOIN operator, one of the OUTER JOIN operators, or a CROSS JOIN. The difference is that a single table is listed as both the left and right table in the join.

Let’s look at an example based on the employee table, shown below:

id first_name last_name salary manager_id
1 John Watson 7550 NULL
2 Anne Brown 3500 1
3 James Black 3000 1
4 Scarlett Miller 2500 3
5 Ethan Davis 1200 3
6 Jacob Smith 2000 3

The employee table stores employees’ ID numbers, first names, last names, salaries, and their manager’s ID number. With the exception of the boss (whose ID = 1), everyone has a supervisor. Only the boss can have a NULL value in the manager_id column.

Let’s try a query that returns the immediate superior of each employee:

SELECT e.id, e.first_name, e.last_name,  e.salary, 
  m.first_name AS fname_boss, m.last_name AS lname_boss
FROM employee e
JOIN employee m ON  e.manager_id = m.id ;

Note that we used the employee table as both the left and right table in the join. To join data from the same table, we had to assign two aliases to the table name.

You must use aliases when performing self-joins. Because you’re joining columns from the same table, they will have the same names. The alias renames the columns so the database engine can execute your query. In the query above, we used the letter e as the alias for the left table. In this case, we are using the employee table for its employee records. We used the letter m as the alias for the right table, which used the employee table for its manager records. Even though we are using one table, SQL treats it as two different tables.

We selected employee data by selecting the id, first_name and last_name columns in the e “table”. The managers’ data was selected by referring to the m “table”. Notice that the employee records were prefixed with an “e” and the manager records were prefixed with an “m”. We could now join them because we used an appropriate joining condition: the manager_id column from the e table was compared to the id column from the m. Records from both tables were matched using a JOIN operator (or INNER JOIN; remember, the “INNER” is implied).

Did you notice that the boss is not shown in the results? They have no match in the right “manager” table, so no data about the boss’s superior could be found.

The image below shows how records are matched during this query execution.

Let’s analyze the data in the output table:

id first_name last_name salary fname_boss lname_boss
2 Anne Brown 3500 John Watson
3 James Black 3000 John Watson
4 Scarlett Miller 2500 James Black
5 Ethan Davis 1200 James Black
6 Jacob Smith 2000 James Black

In our table, the JOIN did not return a record for John Watson. He does not have an identifier pointing to his superior’s id, so he is the boss. We would have to use a LEFT JOIN instead of an INNER JOIN if we wanted to return data for all employees, including the boss. In that case, records from the left table with no match in the right table would be shown as well:

SELECT e.id, e.first_name, e.last_name,  e.salary, m.first_name 
   AS fname_boss  , m.last_name  AS lname_boss
FROM employee e
LEFT JOIN employee m ON  e.manager_id = m.id ;

When Should We Use Self-Joins?

Self-joins are commonly used in the following areas:

  • Hierarchical relationships
  • Sequential relationships
  • Graph data

We’ll look at each of these individually.

sql self join

Hierarchical Data

Processing hierarchical data is one of the most frequent applications of self joins. This occurs when there is an additional column pointing to an identifier in the same table, such as in our employee table. In our case, the manager_id column refers to (has the same value as) the id column.

The example given above (the employee-manager relationship) is not the only situation where hierarchical data is used. A bill of materials for a car shares a similar structure. Each car consists of multiple components, such as the engine, the brake system, and the electrical system. Each of these components consists of smaller parts. A car’s electrical system can be broken into its components, such as the battery and the alternator (which can be broken down into even more car parts). This means that car parts constitute a group of hierarchical data. The car table presented below shows some car parts data.

id name element_id
1 car NULL
2 electrical system 1
3 engine 1
4 battery 2
5 alternator 2

Yet another example of hierarchical data is the parent-child relationship. By storing this relationship, we can use a single table to house an entire family tree. We can then use a self-join to easily retrieve data about the ancestors of a given person.

The table below can help us quickly identify the eldest ancestor(s) in a family. These people do not have ancestor data in their records, which means they form the root of the family tree.

The person table presented below illustrates this:

id first_name last_name birth mother_id father_id
1 John Watson 1945 NULL NULL
2 Anne Brown 1950 NULL NULL
6 Scarlett Miller 1985 2 1
7 Jacob Miller 1982 NULL NULL
8 David Miller 2015 6 7

The query below retrieves the first and last name of each person along with the first and last name of their mother and father.

SELECT c.first_name, c.last_name, m.first_name AS fname_mother, m.last_name  AS lname_mother
FROM family c
LEFT JOIN person m ON  c.mather_id = m.id 
LEFT JOIN person f ON  c.father_id = f.id ;

Note that we had to use a self-join three times (i.e. on three “tables” or aliases) to get the data about both the father and the mother. John Watson, Anne Brown, and Jacob Miller do not have data pointing to their ancestors.

sql basics, sql queries, sql basics online,sql self join example

The examples of hierarchical structures discussed above store data using a superior-inferior record approach. This enables us to present data as a tree structure. You’ll learn more about tree structures as you continue to learn about relational databases.

Sequential Data

Sequential data can also benefit from self-joining techniques. For example, suppose you have records that describe the consecutive steps required to prepare a dish. All of the steps can be placed in a single table. Their order is determined based on the columns that point to the IDs of the previous and next records in the same table.

Let’s illustrate this using the instruction table:

id content previous_id next_id
1 Preheat an oven to 220 degrees C. NULL 2
2 Peel four potatoes. 1 4
3 Toss sliced potatoes with oil. 4 6
4 Cut potatoes into slices. 2 3
5 Season the hot slices with salt and pepper. 6 NULL
6 Bake in the preheated oven for 20 minutes. 3 5

As you see, in the current order these instructions do not make sense. But when we use a self-join to show the sequential relationship for each step in the recipe, we get:

  1. Preheat an oven to 220 degrees C.
  2. Peel four potatoes.
  3. Cut potatoes into slices.
  4. Toss sliced potatoes with oil.
  5. Bake in the preheated oven for 20 minutes.
  6. Season the hot slices with salt and pepper.

We also get some tasty oven fries!

Graphs

Self joins SQL can also be used to show the relationships needed for graphs. A graph is a structure consisting of nodes connected to each other with edges (relations). One example of a graph is the road network between multiple cities.

Take a look at the drawing below.

This graph represents five cities that are connected with each other. Each arrow shows a road from one city to another. In this case, the cities are the nodes and the roads between them are the edges. We’re using two tables to store this data. The city table stores the ID number and the name of each city. The route table contains the route ID number, the starting city (the from_city_id column) and the target city (the to_city_id column).

Here is the “city” table:

id name
1 Laredo
2 San Antonio
3 Austin
4 Waco
5 Houston

And this is the “route” table:

id from_city to_city_id
1 4 1
2 4 3
3 4 2
4 1 4
5 2 3
6 2 5
7 5 3

We can use a self-join on the city table, along with an INNER JOIN of the two tables, to find out what routes exist between cities.

Take a look at the query.

SELECT c1.name AS from_city, c2.name AS to_city 
FROM city c1
JOIN route r ON c1.id = r.from_city_id
JOIN city c2  ON c2.id = r.to_city_id ;

The city and route tables were joined using the id column from city and the from_city_id column from route. At this point, we could only retrieve the name of the start city. In order to retrieve the name of the target city, we used a self-join on the city table. This time, we compared the id from the aliased city table with the to_city_id column in the route table.

Here is the result:

from_city to_city
Waco Laredo
Waco Austin
Waco San Antonio
Laredo Waco
San Antonio Austin
San Antonio Houston
Houston Austin

Self-joins can be used in recursive subqueries that store graphs. More information about this topic can be found in our new course, “Recursive Queries”.

Using Self-Joins to Find Duplicate Values

Self-joins can also be used to identify duplicate values in a table. Let’s introduce an example table called color:

id name
1 blue
2 green
3 yellow
4 blue
5 yellow

Each record in the table is different because of the id column, which must always be unique. But this doesn’t prevent two rows from storing the same color name. We want to identify such cases and find the IDs of the duplicate color names. Let’s try this:

SELECT c1.id AS id1, c1.name  AS color1, c2.id AS id2, c2.name AS color2
FROM color c1
JOIN color c2 ON c1.name = c2.name AND c1.id < c2.id   ; 

We were able to find duplicate color names because we self-joined records based on the color name. The second condition is used to skip identical records from both tables as well as the same pairs of records in reverse order.

Take a look at the query result:

id1 color1 id2 color2
1 blue 4 blue
3 yellow 5 yellow

It is now easy to see that there are duplicate values for blue and yellow.

Learn More

If you would like to learn more about JOINs than we covered in this illustrated guide, check out our SQL Basic course. Remember, the best way to learn SQL JOINs – or anything else – is to learn by doing!

sql basics, sql queries, sql basics online, sql examples, what is sql, sql commands, sql pdf, sql tutorial for beginners, basic sql statements

Data Science Writer @ Vertabelo

GET ACCESS TO EXPERT SQL CONTENT!