Sum all unique values from previous dates
-
04-03-2021 - |
Question
Let's say, for simplicity sake, I have the following table:
id amount p_id date
------------------------------------------------
1 5 1 2020-01-01T01:00:00
2 10 1 2020-01-01T01:10:00
3 15 2 2020-01-01T01:20:00
4 10 3 2020-01-01T03:30:00
5 10 4 2020-01-01T03:50:00
6 20 1 2020-01-01T03:40:00
Here's a sample response I want:
{
"2020-01-01T01:00:00": 25, -- this is from adding records with ids: 2 and 3
"2020-01-01T03:00:00": 55 -- this is from adding records with ids: 3,4,5 and 6
}
I want to get the total (sum(amount)
) of all unique p_id
's grouped by the hour.
The row chosen per p_id
is the one with the latest date
. So for example, the first value in the response above doesn't include id 1
because the record with id 2
has the same p_id
and the date
on that row is later.
The one tricky thing is I want to include the summation of all the amount
per p_id
if their date
is before the hour presented. So for example, in the second value of the response (with key "2020-01-01T03:00:00"), even though id 3
has a timestamp in a different hour, it's the latest for that p_id 2
and therefore gets included in the sum for "2020-01-01T03:00:00". But the row with id 6
overrides id 2
with the same p_id 1
.
In other words: always take the latest amount
for each p_id
so far, and compute the sum for every distinct hour found in the table.
Solution
Simple case
Getting the sum for a single, given hour is relatively simple:
SELECT timestamp '2020-01-01 03:00:00', sum(amount)
FROM (
SELECT DISTINCT ON (p_id) amount
FROM tbl
WHERE date < timestamp '2020-01-01 03:00:00' + interval '1h'
ORDER BY p_id, date DESC
) sub;
DISTINCT ON
is typically substantially faster than subquery / CTE with row_number()
. See:
For many rows per p_id
, there is potential to optimize some more with smart indexing and a corresponding query. See:
Special running sum for every hour
You seem to want the sum for every distinct hour in the table.
You can simply distill distinct hours, and apply the simple solution to ever one of them in a LATERAL
subquery:
WITH grid AS ( -- or some other source?
SELECT DISTINCT date_trunc('hour', date) AS hour
FROM tbl
)
SELECT *
FROM grid
CROSS JOIN LATERAL (
SELECT sum(amount)
FROM (
SELECT DISTINCT ON (p_id) amount
FROM tbl
WHERE date < grid.hour + interval '1h'
ORDER BY p_id, date DESC
) sub1
) sub2
ORDER BY grid.hour;
Should be fine while there are not many distinct hours. But the query does not scale very well. The latest amount
for every p_id
is computed at full (steadily increasing) cost for every additional hour.
Optimize running sum for many hours
I expect this recursive CTE to scale better (while having considerable overhead initially), as it only needs to join to one row per p_id
for every next hour:
WITH RECURSIVE
values AS (
SELECT DISTINCT ON (1, 2)
date_trunc('hour', date) AS hour, p_id, amount
FROM tbl
ORDER BY 2, 1, date DESC -- !
)
, hours AS (
SELECT hour, row_number() OVER (ORDER BY hour) AS hnr
FROM (SELECT DISTINCT hour FROM values) sub
)
, cte AS (
SELECT p_id, h.hour, v.amount, 2 AS next_hnr
FROM (SELECT DISTINCT p_id FROM values) i -- all IDs
JOIN hours h ON hnr = 1 -- first hour
LEFT JOIN values v USING (p_id, hour)
UNION ALL
SELECT c.p_id, h.hour, COALESCE(v.amount, c.amount), c.next_hnr + 1
FROM cte c
JOIN hours h ON h.hnr = c.next_hnr
LEFT JOIN values v ON v.p_id = c.p_id
AND v.hour = h.hour
)
SELECT hour, sum(amount)
FROM cte
GROUP BY 1
ORDER BY 1;
Plus, there may be more potential to optimize, depending on value distribution, like indicated for the simple case at the top.
db<>fiddle here
Aside: calling your timestamp
column "date" is rather misleading as that is a different basic data type.
OTHER TIPS
Look for
WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY p_id ORDER BY date DESC) rn
FROM source_table
WHERE date_trunc('hour', date) = @needed_datetime_truncated_to_hour )
SELECT @needed_datetime_truncated_to_hour, SUM(amount)
FROM cte
WHERE rn = 1;