Why do aggregates in subqueries have to be aggregated again?

https://stackoverflow.com/questions/20642944

19-09-2022
|

Question

I hope I've put the query together correctly below. I've simplified from something much, much bigger. But my question should be clear enough from the example.

In the innermost query, nest1, let's say the results come back with three entries for invoice 123. The invoice was created (1), then two cash deposits were made (2, 3), the second of which (3), paid off the invoice. From that result set, I'm aggregating the sum. I'm also getting the paid_off_date, which only one of the rows will have.

I'm happy to be schooled on better ways to design this query, but what I don't understand is why I have to select the aggregates all the way out. Why, for example, when I select SUM(cash_deposit) AS 'cash_deposit' in nest2, do I also have to select SUM(cash_deposit) in the outermost query? Once it is aggregated in a subquery, why isn't the aggregate passed along? And why do I have to keep grouping by id at each level?

SELECT id, SUM(cash_deposit), MAX(paid_off_date), MAX(job_phase), MAX(cust_id)
FROM
(
    SELECT id AS id, SUM(cash_deposit) AS 'cash_deposit', MAX(paid_off_date) AS 'paid_off_date', MAX(job_phase) AS 'job_phase'
    FROM
    (
         SELECT id, cash_deposit, paid_off_date
         FROM invoice
         GROUP BY id
    ) nest1
    JOIN    job j
            ON nest1.id = j.id
    GROUP BY id
) nest2
JOIN    customer c
        ON c.invoice = nest2.id
GROUP BY id

Clarification

Thanks for any posts with improved versions of the query. I'm working with an existing query that produces unexpected results when I start trying to turn the nesting into joins. I'm going to keep playing with it, but I'd also really appreciate if someone could answer the question about aggregates and why they have to be repeated when nesting. In C#, I can sum up three numbers in one function and then pass the resulting value back. I don't have to keep summing the results. I need some help understanding how SQL subqueries function differently in that respect.

Possible Answer

One of my coworkers pointed out that in this case, It's because of my GROUP BY clauses. Since cash_deposit and job_phase, for example, aren't in the GROUP BY clause, they have to be aggregated.

Solution

It's forcing you to re-run the aggregates, because you are putting group by in each step of the query. If you remove group by id from all but the innermost query, you won't have to re-do the aggregation.

Try writing it like this:

SELECT id, cash_deposit, paid_off_date, job_phase, cust_id
FROM
(
    SELECT id AS id, SUM(cash_deposit) AS 'cash_deposit', MAX(paid_off_date) AS 'paid_off_date', job_phase
    FROM
    (
         SELECT id, cash_deposit, paid_off_date
         FROM invoice
         GROUP BY id
    ) nest1
    JOIN    job j
            ON nest1.id = j.id
) nest2
JOIN    customer c
        ON c.invoice = nest2.id

You could also do the two joins in one step now with the same exact result set, but I wanted to show you the minimum amount of changes necessary.

OTHER TIPS

SELECT i.id, SUM(i.cash_deposit), MAX(i.paid_off_date), MAX(j.job_phase), MAX(c.cust_id)
FROM invoice i
JOIN job j ON j.id = i.id
JOIN customer c ON c.invoice = i.id
GROUP BY id

select id, 
    sum(cash_deposit) as [cash_deposit],
    max(paid_off_date) as [paid_off_date],
    jp.[job_phase],
    cid.[cust_id]
from invoice i
cross apply (
    select max(job_phase) as [job_phase]
    from job j
    where j.id = i.id) jp
cross apply (
    select max(cust_id) as [cust_id]
    from customer c
    where i.id = c.invoice) cid

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow