How to improve poor performance with two joined CTE expressions on hierarchical data?

https://dba.stackexchange.com/questions/281851

12-03-2021
|

Question

I have a table of hierarchical data where I want to select entries that, 1) only include matches that are descendants of a given parent, and 2) returns the full hierarchical path of the matched entries.

I am doing this with two CTEs: one that gets all recursive children from the specified parent, and another that selects matches and gets all the recursive parents of those matches. The main query then joins and groups these two CTEs.

However, the performance is not good. Running either CTE on its own (i.e., only getting enties that are descendants, or only getting path hierarchies) returns the results immediately, but both CTEs together (running on a table with millions of entries) can take minutes to hours, depending on the number of string matches, even a single match. It also won't return the first found entry until all are found.

Is it possible to improve the performance of the below query, or at the very least, make it so it starts returning found results immediately?

The table has this structure:

CREATE TABLE [Files]
(  [fileId]       INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL
,  [name]         TEXT    NOT NULL
,  [parentFileId] INTEGER NULL
,  UNIQUE
   (  [name]
   ,  [parentFileId]
   )
);
CREATE INDEX [Files_parentFileId] ON [Files]([parentFileId]);
CREATE INDEX [Files_name]         ON [Files]([name]);

And the fastest query I've been able to produce is:

-- The [@Context] CTE lists all the fileIds that are descendants
--  of the input @rootFileId. The query should only return
--  matches that are in this list of fileIds
WITH           [@Context]
AS             (
   -- Starting with the ID of the root entry..
   SELECT      @rootFileId           [fileId]
   UNION
   -- ..build up all the IDs of the children
   SELECT      [Files].[fileId]
   FROM        [Files]
   JOIN        [@Context]
      ON       [Files].[parentFileId] = [@Context].[fileId]
               )
-- The [@FilePath] CTE selects each entry that matches the given
--  input @name and all entries that are its parents
,              [@FilePath]
AS             (
   -- Starting with the entries that match the given @name..
   SELECT      -1                    [depth]
   ,           [Files].[fileId]
   ,           [Files].[parentFileId]
   ,           [Files].[name]
   FROM        [Files]
   WHERE       [name]                LIKE @name
   UNION
   -- ..build up all the IDs of the parents
   SELECT      [@FilePath].[depth] - 1
   ,           [@FilePath].[fileId]  -- for grouping entries by the matched entry's id
   ,           [Files].[parentFileId]
   ,           [Files].[name]
   FROM        [Files]
   JOIN        [@FilePath]
      ON       [Files].[fileId]      = [@FilePath].[parentFileId]
               )
-- Group all the parent entries into one column with each found entry
--  and exclude any entry that wasn't on the [@Context] list
SELECT         [@FilePath].[fileId]
,              [@FilePath].[name]
, GROUP_CONCAT([@FilePath].[parentFileId]) [pathFileId]
, GROUP_CONCAT([@FilePath].[name], '\')    [pathname]
FROM           [@FilePath]
JOIN           [@Context]
   ON          [@Context].[fileId]   = [@FilePath].[fileId]
GROUP BY       [@FilePath].[fileId]
ORDER BY       [@FilePath].[depth];

Solution

Welcome to the forum, a good first question with included DDL that makes it easy to reproduce the scenario. For future posts, also include some sample data that one can use. You may even use DB<>Fiddle or a similar site. I have added a slightly modified version of your question at Your Fiddle. The site would not let me add more than ~90000 nodes in the tree, but the answer is returned within seconds for that amount of data at least.

I have removed quoted identifiers and changed them slightly. There are several reasons to avoid quotes, they make the code harder to read, and also make it case sensitive. Since SQL is case insensitive I also avoid camel-case. I also changed your index. Enough about that, here is a walkthrough of the solution:

CREATE TABLE Files
(  file_id       INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL
,  file_name         TEXT    NOT NULL
,  parent_file_id INTEGER NULL
,  UNIQUE
   (  file_name
   ,  parent_file_id
   )
);

CREATE UNIQUE INDEX files_ix1 ON Files (parent_file_id, file_id, file_name);

First step, get the descendants of a certain node. Similar to what you do in your query:

with rec (depth, file_id, parent_file_id, file_name) as ( 
    select 0, file_id, parent_file_id, file_name 
    from files where file_id = 57
    union all
    select depth+1, f.file_id, r.file_id, f.file_name
    from files f
    join rec r
        on f.parent_file_id = r.file_id
)

Second, get the ancestors of the root from previous question. I have added a depth attribute that makes it easy to identify this node:

), root_path (depth, file_id, parent_file_id, file_name) as (
    select depth, file_id, parent_file_id, file_name 
    from rec where depth = 0
    union all
    select r.depth-1, f.file_id, f.parent_file_id, f.file_name
    from root_path r
    join files f
        on r.parent_file_id = f.file_id
)

Your counter-part is likely to blame for performance issues since you traverse the whole tree, which you then have to join.

For the third step, we combine the ancestors with the descendants. I have used a union to get rid of the node that exists in both, you can probably increase efficiency by filtering one of the legs and use union all instead:

tree (depth, file_id, parent_file_id, file_name) as (        
    select depth, file_id, parent_file_id, file_name from root_path
    union
    select depth, file_id, parent_file_id, file_name from rec
)

Now tree contains all relevant nodes, so we can recursively construct the transitive closure of the tree:

closure (file_id, parent_file_id, file_name, file_path) as (
    select file_id, parent_file_id, file_name, '/'||file_name 
    from tree
    where depth = (select min(depth) from tree)
    union all
    select t.file_id, t.parent_file_id, t.file_name
         , c.file_path || '/' || t.file_name 
    from tree t
    join closure c
        on t.parent_file_id = c.file_id
)
select * from closure

If traversing and aggregating information over the tree, is something you do a lot, you may consider some kind of Incremental Evaluation System (IES). The idea is that the recursive part above is already pre-calculated when you need it. There are basically 3 different versions of this that you can google for more info

Nested Sets, though not the inventor, Joe Celko has written a lot of articles and books about it. The basic idea is that you have an upper and lower bound on each node, any node that embraces another node is an ancestor of that node. The main criticism of the method is that small changes of the tree may be expensive since it can cause a renumbering of large parts of the tree.
Materialized path, you basically store the ancestor part in each node and use like predicate to find ancestors/descendants. Choose token separator wisely or you may end up with false results.
Use a separate ancestors table, often referred to as Transitive Closure (TC) table. You can maintain the ancestor's table with triggers on the tree table. I have used this for trees with millions of rows without problems. You need to be aware that the ancestor table typically is 5-10 times the cardinality of the tree table.

EDIT: FWIW I tried the query against my local Db2 11.5.5 with 9 million rows. The exact same query returns 29529 rows in ~20 seconds. If I reduce the result by choosing a start further down the tree I get:

file_id = 57      -> 29529 rows, 20 seconds
file_id = 5703    -> 1106 rows,  0.38 seconds
file_id = 57036   -> 132 rows,   0.15 seconds
file_id = 571036  -> 30 rows,    0.08 seconds

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange