Converting Subqueries to Joins

join subquery, join subquery mysql, left join subquery mysql, sql server join subquery, join subquery oracle, how to join two subqueries in sql, mysql inner join subquery, left join subquery oracle, how to join a table with a subquery

Not all queries are alike, especially in terms of performance. In this article, we’ll look at how you can convert SQL subqueries to joins for improved efficiency.

When should I use SQL subqueries?

Great question! Unfortunately, there’s no concrete answer. SQL beginners tend to overuse subqueries. Typically, once they find that SQL construction works in one situation, they try to apply that same approach to other situations. It’s only natural.

However, in some cases a subquery can be replaced with a more efficient JOIN. If you can avoid a subquery and replace it with a JOIN clause, you should do so without hesitation. But of course, in some cases, using a subquery is the only way to solve a data question. In this article, I’ll show examples of both cases: when a subquery is a must and when a subquery should be avoided and replaced by a JOIN.

The data

Before getting to the examples, let’s briefly look at the sample database we will use. The database has two tables representing the production statistics of a fictional apple farm company named EverRed. The company has three farms.

The first table is current_year_production, which contains information about the number of apples produced in the current year by each farm, as well as the area and number of trees on each farm. The second table, production_history, stores past production information for each farm. Below are some sample data from these two tables.

current_year_production

farm_id area_m2 farm_name number_of_trees production_in_kg
100 10000 The Paradise 240 4400
101 15000 Evergreen 300 6200
102 20000 Red Delicious 580 9300

production_history

farm_id year production_in_kg price_ton
100 2017 4100 1200
101 2017 5800 1200
102 2017 9400 1200
100 2016 3900 1300
101 2016 6400 1300
102 2016 9100 1300

Example: replacing a subquery with a JOIN

Suppose you’re an SQL data analyst working at EverRed. The owner of the company wants you to obtain the names of the farms where the company is producing more apples in the current year than in the previous year (2017).

Solution using a subquery:

SELECT farm_name, 
FROM    current_year_production CYP
WHERE  production_in_kg > ( SELECT production_in_kg 
                                   FROM production_history PH  
                                   WHERE PH.farm_id = CYP.farm_id
                                   AND year = 2017
                            ) 

Solution using a JOIN clause:

SELECT farm_name, 
FROM    current_year_production CYP 
        JOIN  production_history PH  ON  PH.farm_id = CYP.farm_id 
WHERE  PH.year = 2017
AND       CYP.production_in_kg > PH.production_in_kg 

The difference between these two approaches is in performance. While the JOIN clause in the second example needs to be executed only once, the subquery in the first example will be executed once per farm. In this case, we only have three farms, so the difference is negligible. But what if you worked for a larger company that has 10,000 global farms? The subquery would need to be executed 10,000 times. Clearly, a subquery is inefficient for our purposes here.

Moreover, in a test database with only a few farms, both queries execute with an acceptable response time; however, when we move to a productive database, (where the data volume is usually much higher), the response time of the subquery approach will increase significantly, while response time of JOIN approach will remain stable.

The result of both previous equivalent queries is:

farm_name
The Paradise
Evergreen

Example: when subqueries are the only way to go

Let’s now suppose the owner of the company, after reading the results you delivered in the previous query, asks you to obtain the names of the farms that are producing more apples per square meter this year than the historical average. This sounds complex, but it’s easier than it seems.

Solution using a subquery:

SELECT farm_name, 
               production_in_kg / area AS “production_per_meter”
FROM    Current_year_production 
WHERE  production_in_kg / area 
                > ( SELECT AVG(PH.production_in_kg / CYP.area) 
                    FROM    production_history PH 
                        JOIN  Current_year_production CYP  
                        ON  PH.farm_id = CYP.farm_id 
                    ) 

We can’t replace this subquery with a JOIN because we don’t have a table with the average previously calculated. In other words, we need to first calculate the historical average. And to do that, we need a GROUP BY, which can break the one-to-one relation needed for a JOIN. Another point to note is that the metric “apples per square meter” is obtained with the following expression:

	production_in_kg / area

We used the “apples per square meter” metric because we need some way to compare the productivity of the different farms and rank them. The total “production_in_kg” of a farm is not a comparable value—because it’s likely, for instance, that the biggest farm will have a better production_in_kg. So we divide “production_in_kg” by the area of each farm to standardize the values and create a comparable metric.

We find that the historical average production per square meter is 0.42. Then the result of the previous query is:

farm_name production_per_meter
The Paradise 0.44
Red Delicious 0.47

Example: when JOINs and subqueries are equally efficient

As a last data question, let’s try to obtain the years when the company produced fewer apples than in the current year. We can write this query using two different approaches.

Solution using a subquery:

SELECT year , sum(production_in_kg) 
FROM    production_history PH
GROUP by year
HAVING sum(production_in_kg) < ( SELECT sum(production_in_kg)
                                                          FROM current_year_production
                                                       )

Solution using a JOIN clause:

SELECT year , sum(PH.production_in_kg) 
FROM production_history PH 
           JOIN current_year_production CYP ON PH.farm_id = CYP.farm_id
GROUP BY year
HAVING sum(PH.production_in_kg) < sum(CYP.production_in_kg)

You can see that both of these queries are really similar; the main difference is in the JOIN clause and the subquery. In this case, both queries are equally efficient—the subquery is executed one time in the HAVING clause, so there is no performance issue.

Try it yourself!

To wrap things up, it’s important to note that subqueries and JOINs are both really important resources for an SQL developer. We saw examples where we can replace a subquery with a JOIN and examples where we cannot do such a replacement. And sometimes, subqueries and JOINs are equally efficient.

But how do you know when to use a subquery and when to use a JOIN? In all honesty, the only way to develop your intuition is to regularly solve SQL exercises. If you’re looking to sharpen your SQL skills, our SQL Practice Set offers 88 practice problems for a comprehensive review.

Stay tuned for more articles!

Ignacio is a database consultant from Buenos Aires, Argentina. He’s worked for 15 years as a database consultant for IT companies like Informix and IBM. These days, he teaches databases at Sarmiento University and works as a PostgreSQL independent SQL consultant. A proud father of four kids with 54 years in his backpack, Ignacio plays soccer every Saturday afternoon, enjoying every match as if it’s his last one.

GET ACCESS TO EXPERT SQL CONTENT!