Add cumulative sum to time-series query PostgreSQL 9.5
-
07-01-2021 - |
Question
I wrote the query that gives me time-series over some date range and interval that shows revenue for each time interval:
SELECT
interval_date,
coalesce(campaign_revenue,0) AS campaign_revenue,
FROM
-- generate_series helps fill the empty gaps in the following JOIN
generate_series(
$2::timestamp,
$3::timestamp,
$4) AS interval_date -- could be '1 day', '1 hour' or '1 minute'.
LEFT OUTER JOIN
-- This SELECT gets all timeseries rows that have data
(SELECT
date_trunc($4, s.created) AS interval,
SUM(s.revenue) campaign_revenue
FROM
sale_event AS s
WHERE
s.campaignid = $1 AND s.created BETWEEN $2 AND $3 AND s.event_type = 'session_closed'
GROUP BY
interval) results
ON
(results.interval = interval_date);
The query takes every row of sale_event
table, truncates the created date to some interval (aligns the created
timestamp with the time-series wanted granularity), groups by this time interval and sums up the revenue
columns on the rows where event_type
is session_closed
.
This works very well and gives me the revenue in the specified interval. The result may look like:
interval_date | campaign_revenue
------------------------------------
2018-08-05 | 0.0
2018-08-06 | 1.5
2018-08-07 | 0.0
2018-08-08 | 0.5
2018-08-09 | 1.0
When the provided range is 2018-08-05 - 2018-08-09
and interval = '1 day'
.
I want to add to the result the sum of revenue up to that date. So if before 2018-08-05
there a total revenue of 10.0
, the result would be:
interval_date | campaign_revenue | total_campaign_revenue
-----------------------------------------------------------------
2018-08-05 | 0.0 | 10.0
2018-08-06 | 1.5 | 11.5
2018-08-07 | 0.0 | 11.5
2018-08-08 | 0.5 | 12.0
2018-08-09 | 1.0 | 13.0
Solution
If I get it right you can just add a window function outside of your query like:
SELECT interval_date, campaign_revenue
, SUM(campaign_revenue) OVER (ORDER BY interval_date)
+ (SELECT SUM(revenue)
FROM sale_event
WHERE s.campaignid = $1
AND s.created < $2
AND s.event_type = 'session_closed') as total_campaign_revenue
FROM (
SELECT interval_date
, coalesce(campaign_revenue,0) AS campaign_revenue
FROM
-- generate_series helps fill the empty gaps in the following JOIN
...
interval) results
ON (results.interval = interval_date)
);
Another option is to apply the window function directly, and use a FILTER clause for campaign_revenue
OTHER TIPS
It might be faster to read all relevant rows from the underlying table in one scan.
And you can run a window function over an aggregate function in the same SELECT
.
Test this with EXPLAIN (ANALYZE, TIMING OFF)
to see which is faster:
SELECT interval_ts
, coalesce(revenue , 0) AS campaign_revenue
, coalesce(total_revenue, 0) AS total_campaign_revenue
FROM generate_series($2::timestamp, $3::timestamp, $4) AS interval_ts
LEFT JOIN (
SELECT date_trunc($4, created) AS interval_ts
, SUM(revenue) AS revenue
, SUM(SUM(revenue)) OVER (ORDER BY date_trunc($4, created)) AS total_running
FROM sale_event AS s
WHERE campaignid = $1
AND created <= $3 -- read all relevant rows in one scan
AND event_type = 'session_closed'
GROUP BY date_trunc($4, created)
) results USING (interval_ts);
The JOIN
excludes leading surplus rows in the subquery automatically.
SUM(SUM(revenue)) OVER (ORDER BY date_trunc($4, created))
works because, quoting the manual:
The default framing option is
RANGE UNBOUNDED PRECEDING
, which is the same asRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. WithORDER BY
, this sets the frame to be all rows from the partition start up through the current row's lastORDER BY
peer.
Exactly what you need.
Related:
- Calculating Cumulative Sum in PostgreSQL
- PostgreSQL: running count of rows for a query 'by minute'
- How do I get the aggregate of a window function in Postgres?
Remaining weakness: The total is missing for intervals with no revenue. If that's not acceptable, we can use this technique to fix:
So:
SELECT interval_ts, campaign_revenue, total_revenue
, coalesce(first_value(total_revenue) OVER (PARTITION BY grp ORDER BY interval_ts), 0) AS total_campaign_revenue
FROM (
SELECT interval_ts
, coalesce(revenue, 0) AS campaign_revenue
, total_revenue
, count(total_revenue) OVER (ORDER BY interval_ts) AS grp
FROM (
SELECT interval_ts
, coalesce(revenue, 0) AS campaign_revenue
, count(total_revenue) OVER (ORDER BY interval_ts) AS grp
FROM generate_series($2::timestamp, $3::timestamp, $4) AS interval_ts
LEFT JOIN (
SELECT date_trunc($4, created) AS interval_ts
, SUM(revenue) AS revenue
, SUM(SUM(revenue)) OVER (ORDER BY date_trunc($4, created)) AS total_running
FROM sale_event AS s
WHERE campaignid = $1
AND created <= $3 -- read all relevant rows in one scan
AND event_type = 'session_closed'
GROUP BY date_trunc($4, created)
) results USING (interval_ts)
) sub1
) sub2;
With the added overhead, I am not sure it can compete. Still might if your selection is small and the table is big.
Minor notes:
- You don't need parentheses around join conditions.
- Don't call your timestamp "date". That's misleading. I use
interval_ts
istead ofinterval_date
. I'd rather not use the SQL keyword
interval
as column alias - even if that's allowed in Postgres.Working with the same column alias
interval_ts
to allow the shorterUSING
syntax - which does require parentheses. This only exposes one instance of the joined columnsinterval_ts
to the outer query, so the unqualified name still isn't ambiguous.- Don't omit the
AS
key word for column aliases (while that's ok for table aliases).