Multiple aggregates with cross (lateral) join
-
06-02-2021 - |
Question
Following on from Combining array_agg and unnest, given this data:
key | a | b | c
------------------
1 | 0 | 1 | {1,2}
1 | 1 | 2 | {3}
1 | -1 | 3 | {2}
1 | 2 | 4 | {}
Executing this query:
SELECT
d.key,
min(d.a) AS a,
sum(d.b) AS b,
array_agg(DISTINCT x.c) AS c
FROM data AS d
CROSS JOIN LATERAL unnest(d.c) AS x(c)
GROUP BY d.key
Gives the unexpected result:
key | a | b | c
------------------
1 | -1 | 7 | {1,2,3}
What's going on here, and how to get the correct sum?
Answer performance
On my actual data (16642 rows, 1942 keys, 6 aggregates), I get these cost estimates for each suggested solution.
- a_horse_with_no_name option 1:
1761.12..49370.52
- a_horse_with_no_name option 2:
0.57..89214.72
- Erwin Brandstetter:
1761.12..49370.61
Solution
The unnesting generates 2 rows for (key = 1, a = 0, b = 1), and the cross join removes the row with the empty array.
So your group by operates on the following set:
key | a | b | c
----+----+---+--
1 | 0 | 1 | 1
1 | 0 | 1 | 2
1 | 1 | 2 | 3
1 | -1 | 3 | 2
One solution is to combine two group by queries, each one grouping n a different level:
select *
from (
select d1."key", min(d1.a), sum(d1.b)
from data d1
group by d1."key"
) m
join (
select "key", array_agg(DISTINCT x.c) AS c
from data d2
left join lateral unnest(d2.c) as x(c) on true
where x.c is not null
group by "key"
) a on a."key" = m."key";
Another approach is to only include the "first row" for each "unnested" group in the aggregates:
select d."key",
min(d.a) filter (where idx = 1 or idx is null),
sum(d.b) filter (where idx = 1 or idx is null),
array_agg(distinct x.c)
from data AS d
left join lateral unnest(d.c) with ordinality AS x(c,idx) on true
group by d."key";
with ordinality
returns the position of the unnested element in the original array. For the row with the empty array, that will be null.
OTHER TIPS
One problem with your query is that the CROSS JOIN
eliminates rows where unnest()
produces no rows (happens for the empty array {}
).
You could fix that with LEFT JOIN .. ON true
, but the other problem is that rows are multiplied where unnest()
returns multiple rows (happens for {1,2}
).
Thats how you get 7
for the sum: 1 + 1 + 2 + 3
.
Joining two separate subqueries might be simplest / fastest:
SELECT *
FROM (
SELECT key, min(a) AS min_a, sum(b) AS sum_b
FROM data
GROUP BY 1
) x
JOIN (
SELECT key, array_agg(DISTINCT e) AS dist_c
FROM data, unnest(c) e
GROUP BY 1
) y USING (key);
The plain join works because both subqueries have grouped by the same column - and as long as key
is NOT NULL
.
db<>fiddle here
Related: