How to aggregate over a “multi source” with recursive, in postgres?
-
02-03-2021 - |
Question
Imagine I have a table "departments" like this:
name TEXT
parent_department TEXT (nullable)
I also have a "budgets" table like this:
department_name TEXT
budget INTEGER
I need to "traverse" this departments table from "leaf to root", following the parent department column. The end result is that I need to sum up the budgets of each department, but the twist is that the "budgets" table that I'll be using at any given time is very sparse (10s to 100s of rows), while departments is very dense (100m rows).
Normally, I would approach this by creating a "with recursive" CTE that "flattens" the department relationship into a temporary table of just "root department", "leaf department". Then I would join that table on leaf department to budgets, and group by root. The problem is that the "department" table is FAR too big now (the analogy to departments breaks down here a little bit I admit, no company has >100m departments).
Concretely, if I had departments like this (child -> parent):
C -> B
B -> A
A -> NULL
E -> D
D -> NULL
F -> G
G -> NULL
And budgets like this:
A 1
C 3
E 5
I would want to get as output:
A 4
D 5
And ideally, F and G, for example, would never be touched, because there are hundreds of millions of rows like F and G that shouldn't come up. But in my current query, they do. How can I fix this so that I only "traverse" the departments that I actually need to?
Solution
It took a while, but I figured it out:
WITH RECURSIVE
initial AS (
SELECT
department_name,
department_name,
FALSE
FROM
budgets
GROUP BY department_name
),
tmp (child, parent, root) AS (
SELECT
*
FROM
initial
UNION ALL
SELECT
tmp.child,
CASE WHEN
departments.parent_department IS NULL THEN tmp.parent
ELSE departments.parent_department
END,
departments.parent_department IS NULL
FROM
tmp
INNER JOIN departments
ON tmp.parent = departments.name
WHERE
NOT tmp.root
)
SELECT
tmp.parent,
SUM(budget)
FROM
tmp
INNER JOIN budgets
ON tmp.child = budgets.department_name
WHERE
tmp.root
GROUP BY tmp.parent