Get to Know the Power of SQL Recursive Queries

recursive queries

Level: Advanced

Most commonly, the SQL queries we run on a database are quite simple. Well, that depends on your role, of course. Analysts in data warehouses retrieve completely different sorts of information using (very often) much more complicated queries than software engineers creating CRUD applications.

However, sometimes it’s simpler or more elegant to run a query that is a little bit more sophisticated without needing further data processing in the code. One way to accomplish this is with a SQL feature called recursive queries.

Let’s take a real-life example. Let’s assume we’ve got a database with a list of nodes and a list of links between them (you can think of them as cities and roads). Our task is to find the shortest path from node 1 to node 6.

graph

 

Well, in fact, it’s nothing more than graph traversal. The very first idea an average software engineer may have would be to get all rows from both tables and implement a DFS (Depth-First Search) or BFS (Breadth-First Search) algorithm in his/her favorite programming language. It’s not a bad idea (if you like coding 😉 ) but you can do it with a single SQL query!

WITH Clause – Common Table Expressions to the Rescue!

Note: all examples are written for PostgreSQL 9.3; however, it shouldn’t be hard to make them usable with a different RDBMS.

If your RDBMS is PostgreSQL, IBM DB2, MS SQL Server or Oracle (only from 11g release 2), you can use WITH queries, known as Common Table Expressions (CTEs). Generally speaking, they allow you to split complicated queries into a set of simpler ones which makes a query easier to read. The structure of a WITH clause is as follows:

WITH [cte_name] AS (
	[cte_term])
SELECT ... FROM [cte_name];

For example, we might want to get at most 3 nodes, whose total length of outgoing links is at least 100 and at least one single outgoing link has a length bigger than 50. You don’t have to fully understand the following example, just look at the query structure. Instead of writing a query like this:

SELECT * FROM node 
WHERE id IN (
    SELECT distinct node_from_id 
    FROM link
    WHERE length > 50 
      AND node_from_id IN (
        SELECT node_from_id
        FROM link 
        GROUP BY node_from_id 
        HAVING SUM(length) >= 100 
        ORDER BY SUM(length) DESC
        LIMIT 3));

we can write it like this:

WITH 
longest_outgoing_links AS (
    SELECT node_from_id
    FROM link 
    GROUP BY node_from_id 
    HAVING SUM(length) >= 100 
    ORDER BY SUM(length) DESC
    LIMIT 3),
nodes_from_longest_outgoing_links AS (
    SELECT distinct node_from_id 
    FROM link
    WHERE length > 50 
      AND node_from_id IN (SELECT * FROM longest_outgoing_links)
)
SELECT * FROM node 
WHERE id IN (SELECT * FROM nodes_from_longest_outgoing_links);

The queries are defined separately, which makes it a lot easier to understand when implementing much more complicated examples.

An important point: CTEs may also have a recursive structure:

WITH RECURSIVE [cte_name] (column, ...) AS (
    [non-recursive_term]
    UNION ALL
    [recursive_term])
SELECT ... FROM [cte_name];

How does it work?

It’s quite simple. In the first step a non-recursive term is evaluated. Next, for every result row of the previous evaluation, a recursive term is evaluated and its results are appended to the previous ones. The recursive term has access to results of the previously evaluated term.

An Easy Example #1

Let’s take a look at a simple example – multiplication by 2:

WITH RECURSIVE x2 (result) AS ( 
    SELECT 1 
    UNION ALL 
    SELECT result*2 FROM x2) 
SELECT * FROM x2 LIMIT 10; 

 result 
-------- 
      1 
      2 
      4 
      8 
     16 
     32 
     64 
    128 
    256 
    512 
(10 rows)

In the first step, the only result row is “1.” Then, there is UNION ALL with a recursive term. 1 is multiplied by 2, which results in one result row – “2”. For now, there are two result rows: 1, 2. However, the last term evaluation produced only one row – “2” – and it will be passed to the next recursive step. 2 is multiplied by 2, which returns “4,” and so on… In this example, recursion would be infinite if we didn’t specify the LIMIT clause.

An Easy Example #2

Let’s do another quick (typically academic) example – the Fibonacci sequence. It’s defined as follows:

F(n) = 0 , n = 0
1 , n = 1
F(n-1) + F(n-2) , n > 1
F(0) F(1) F(2) F(3) F(4) F(5) F(6) F(7) F(8) F(9)
0 1 1 2 3 5 8 13 21 34

Such a function can be defined in SQL using the WITH clause:

WITH RECURSIVE fib(f1, f2) AS ( 
    SELECT 0, 1 
    UNION ALL 
    SELECT f2, (f1+f2) FROM fib ) 
SELECT f1 FROM fib LIMIT 10;
 f1 
---- 
  0 
  1 
  1 
  2 
  3 
  5 
  8 
 13 
 21 
 34 
(10 rows)

I hope the concept is now clear.

Graph Traversal

Let’s go back to our example with a graph traversal. What we want to do is to find the shortest path between two nodes. This is how DB structure looks like:

 

Just to make our SQL more readable, let’s define a simple view node_links_view joining node with link and with node again:

CREATE VIEW node_links_view AS 
SELECT 
    n1.id AS node_id, 
    n1.name AS node_name, 
    n2.id AS destination_node_id, 
    n2.name AS destination_node_name, 
    l.length AS link_length 
FROM 
    node n1 
    JOIN link l ON (n1.id = l.node_from_id) 
    JOIN node n2 ON (l.node_to_id = n2.id);

Now, our model structure looks as follows:

 

What do we need as a result of the query? We want an exact path between the nodes and its entire length. In order to exclude any cycles in the graph, we also need a flag to identify if the last node was already visited. So, the first part of CTE definition will look like this:

WITH RECURSIVE search_path (path_ids, length, is_visited) AS 

In the first step we have to get all links from the beginning node:

    SELECT 
        ARRAY[node_id, destination_node_id], 	-- path_ids
        link_length,  				-- length
        node_id = destination_node_id  		-- is_visited
    FROM 
        node_links_view;

It’s our non-recursive term.

Now, we’ll go recursively starting from the last visited node, which is the last element in an array:

    SELECT 
        path_ids || d.destination_node_id,  	-- path_ids
        f.length + d.link_length, 		-- length
        d.destination_node_id = ANY(f.path_ids)	-- is_visited
    FROM 
        node_links_view d, 
        search_path f 
    WHERE 
        f.path_ids[array_length(path_ids, 1)] = d.node_id 
        AND NOT f.is_visited;

How does it work? Look at the FROM and WHERE clauses. The query gets the next rows from node_link_view which start at the last node of the previous evaluation that didn’t finish with a cycle. It returns an array extended with a destination node of the link, a sum of lengths and a flag determining if this node was previously visited. This recursive part of the query will be executed as long as there are any links to non-visited nodes.

baner sql creating tables

So, here is a complete SQL query retrieving all paths from the node with id=1 to the node with id=6:

WITH RECURSIVE search_path (path_ids, length, is_visited) AS 
( 
    SELECT 
        ARRAY[node_id, destination_node_id], 
        link_length, 
        node_id = destination_node_id 
    FROM 
        node_links_view 

    UNION ALL 
    SELECT 
        path_ids || d.destination_node_id, 
        f.length + d.link_length, 
        d.destination_node_id = ANY(f.path_ids) 
    FROM 
        node_links_view d, 
        search_path f 
    WHERE 
    f.path_ids[array_length(path_ids, 1)] = d.node_id 
    AND NOT f.is_visited 
) 
SELECT * FROM search_path 
WHERE path_ids[1] = 1 AND path_ids[array_length(path_ids, 1)] = 6 
ORDER BY length;

As a result we get all paths from node 1 to node 6 ordered by total path length:

   path_ids    | length | is_visited 
---------------+--------+------------ 
 {1,3,2,5,6}   |    140 | f 
 {1,2,5,6}     |    150 | f 
 {1,3,4,5,6}   |    150 | f 
 {1,3,4,6}     |    190 | f 
 {1,2,3,4,5,6} |    200 | f 
 {1,2,3,4,6}   |    240 | f 
(6 rows) 

The shortest path is the first one, so we could add a LIMIT clause to get just one result.

Remember that we created the external view – node_links_view – to make the SQL easier to read? We may do the same with a CTE:

WITH RECURSIVE search_path (path_ids, length, is_visited) AS 
( 
    SELECT 
        ARRAY[node_id, destination_node_id], 
        link_length, 
        node_id = destination_node_id 
    FROM node_links_select 

    UNION ALL 
    SELECT 
        path_ids || d.destination_node_id, 
        f.length + d.link_length, 
        d.destination_node_id = ANY(f.path_ids) 
    FROM node_links_select d, 
        search_path f 
    WHERE 
        f.path_ids[array_length(path_ids, 1)] = d.node_id 
        AND NOT f.is_visited 
), 

node_links_select AS ( 
    SELECT 
        n1.id AS node_id, 
        n1.name AS node_name, 
        n2.id AS destination_node_id, 
        n2.name AS destination_node_name, 
        l.length AS link_length 
    FROM 
        node n1 
        JOIN link l ON (n1.id = l.node_from_id) 
        JOIN node n2 ON (l.node_to_id = n2.id) 
) 

SELECT * FROM search_path 
WHERE path_ids[1] = 1 AND path_ids[array_length(path_ids, 1)] = 6 
ORDER BY length;

Note: this example is by no means optimized! Its purpose is just to show you how to use recursive CTEs.

Oracle (Prior to 11g Release 2) – Hierarchical Queries

Up to Oracle 11g release 2, Oracle databases didn’t support recursive WITH queries. In Oracle SQL these kinds of queries are called hierarchical queries and they have completely different syntax, but the idea is quite the same. You can read more about hierarchical queries in the Oracle documentation.

sql cheat sheet

MySQL – 33 Months of Silence…

Bad news for MySQL users. It doesn’t support WITH clause though there were many feature requests asking for it. Probably the first one was this one which had been ignored for 33 months and hasn’t been resolved since January 2006…

I hope the idea of recursive queries is now clear to you. Enjoy recursively enjoying recursive queries!

Chief Technology Officer @ e-point

  • Vic

    Hi Michal, great article and good explanation. You note that the example is written for postgreSQL but ‘it shouldn’t be hard to make them usable with a different RDBMS’. I am trying to implement the graph traversal in SQLite (unfortunately for this application I don’t have the liberty of changing to a different database), which does not support the ARRAY object. Do you know how it would be possible to overcome this limitation?

    • Michał Kołodziejski

      Hi Vic. SQLite supports very few datatypes. I think you could simulate ARRAY with a TEXT type.
      If you add something like ‘:id:’ to the path_ids, you’d end up with such a “list”: ‘:1::3::2::5::6:’
      Then you can test if a node was already visited like this: path_ids like ‘%:’ || d.destination_node_id || ‘:%’.
      You can specify the first and the last node on the list in the same way.
      This will probably be much slower than using ARRAY type in PostgreSQL, but it’s just a workaround. I believe that for small datasets it should work quite well.
      Hope this helps.

  • jmadblast kira

    Hi there, this is a great post, I wonder if it could also be used in this situation.. I have a table which has two columns named FromCol and ToCol, and this is the following sample data: (1,2), (1,3), (3,4), (5,10), (5,6), (7,5), (1,7), (2,8), (8,9), (9,11), (9,10), (10,12), (10,14), (12,13), (14,15)… I would like to get the shortest possible path from FromCol=1 to ToCol=15.. with the given sample data, I expect the output to be: (1,7),(7,5),(5,10),(10,14),(14,15)…
    Thank you in advance.

    • Michał Kołodziejski

      Hi,
      To accomplish your task, you just need to slightly modify the code provided in the article. All you need to change is to assume the distance between nodes to be always 1 and modify the expected output format.

      For PostgreSQL the code would look like this:
      WITH RECURSIVE search_path (path_ids, expected_output, length, is_visited) AS
      (
      SELECT
      ARRAY[fromid, toid],
      ‘(‘ || fromid || ‘, ‘ || toid || ‘)’,
      1,
      fromid = toid
      FROM tmpTbl

      UNION ALL
      SELECT
      path_ids || d.toid,
      f.expected_output || ‘, ‘ || ‘(‘ || d.fromid || ‘, ‘ || d.toid || ‘)’,
      f.length + 1,
      d.toid = ANY(f.path_ids)
      FROM tmpTbl d,
      search_path f
      WHERE
      f.path_ids[array_length(path_ids, 1)] = d.fromid
      AND NOT f.is_visited
      )
      SELECT * FROM search_path
      WHERE path_ids[1] = 1 AND path_ids[array_length(path_ids, 1)] = 15
      ORDER BY length;

      This produces the output:

      path_ids | expected_output | length | is_visited
      ——————–+—————————————————–+——–+————
      {1,7,5,10,14,15} | (1, 7), (7, 5), (5, 10), (10, 14), (14, 15) | 5 | f
      {1,2,8,9,10,14,15} | (1, 2), (2, 8), (8, 9), (9, 10), (10, 14), (14, 15) | 6 | f
      (2 rows)

      As you can see, there’s your expected result on the first position. If you change the last SELECT to gather only ‘expected_output’ column and add ‘LIMIT 1’ clause, you’ll get just what you need.

  • Vinay Kumar

    The table structure is not visible in this page, how can we get the graph data into a table?
    node_links_view = ?

GET ACCESS TO EXPERT SQL CONTENT!