Add cumulative sum to time-series query PostgreSQL 9.5

https://dba.stackexchange.com/questions/214577

07-01-2021
|

Question

I wrote the query that gives me time-series over some date range and interval that shows revenue for each time interval:

SELECT
    interval_date,
    coalesce(campaign_revenue,0) AS campaign_revenue,
FROM
    -- generate_series helps fill the empty gaps in the following JOIN
    generate_series(
        $2::timestamp,
        $3::timestamp,
        $4) AS interval_date -- could be '1 day', '1 hour' or '1 minute'.
LEFT OUTER JOIN
    -- This SELECT gets all timeseries rows that have data
    (SELECT
        date_trunc($4, s.created) AS interval,
        SUM(s.revenue) campaign_revenue
    FROM
        sale_event AS s
    WHERE
        s.campaignid = $1 AND s.created BETWEEN $2 AND $3 AND s.event_type = 'session_closed'
    GROUP BY
        interval) results
ON
    (results.interval = interval_date);

The query takes every row of sale_event table, truncates the created date to some interval (aligns the created timestamp with the time-series wanted granularity), groups by this time interval and sums up the revenue columns on the rows where event_type is session_closed.

This works very well and gives me the revenue in the specified interval. The result may look like:

interval_date   |   campaign_revenue
------------------------------------
 2018-08-05     |   0.0
 2018-08-06     |   1.5
 2018-08-07     |   0.0
 2018-08-08     |   0.5
 2018-08-09     |   1.0

When the provided range is 2018-08-05 - 2018-08-09 and interval = '1 day'.

I want to add to the result the sum of revenue up to that date. So if before 2018-08-05 there a total revenue of 10.0, the result would be:

interval_date   |   campaign_revenue   |   total_campaign_revenue
-----------------------------------------------------------------
 2018-08-05     |   0.0                |   10.0
 2018-08-06     |   1.5                |   11.5
 2018-08-07     |   0.0                |   11.5
 2018-08-08     |   0.5                |   12.0
 2018-08-09     |   1.0                |   13.0

Solution

If I get it right you can just add a window function outside of your query like:

SELECT interval_date, campaign_revenue
     , SUM(campaign_revenue) OVER (ORDER BY interval_date) 
      + (SELECT SUM(revenue) 
         FROM sale_event
         WHERE s.campaignid = $1
           AND s.created < $2
           AND s.event_type = 'session_closed') as total_campaign_revenue
FROM (
    SELECT interval_date
         , coalesce(campaign_revenue,0) AS campaign_revenue
    FROM
        -- generate_series helps fill the empty gaps in the following JOIN
        ...
        interval) results
    ON (results.interval = interval_date)
);

Another option is to apply the window function directly, and use a FILTER clause for campaign_revenue

OTHER TIPS

It might be faster to read all relevant rows from the underlying table in one scan.
And you can run a window function over an aggregate function in the same SELECT.

Test this with EXPLAIN (ANALYZE, TIMING OFF) to see which is faster:

SELECT interval_ts
     , coalesce(revenue      , 0) AS campaign_revenue
     , coalesce(total_revenue, 0) AS total_campaign_revenue    
FROM   generate_series($2::timestamp, $3::timestamp, $4) AS interval_ts
LEFT   JOIN (
   SELECT date_trunc($4, created) AS interval_ts
        , SUM(revenue)                                              AS revenue
        , SUM(SUM(revenue)) OVER (ORDER BY date_trunc($4, created)) AS total_running
   FROM   sale_event AS s
   WHERE  campaignid = $1
   AND    created <= $3                   -- read all relevant rows in one scan
   AND    event_type = 'session_closed'
   GROUP  BY date_trunc($4, created)
   ) results USING (interval_ts);

The JOIN excludes leading surplus rows in the subquery automatically.

SUM(SUM(revenue)) OVER (ORDER BY date_trunc($4, created)) works because, quoting the manual:

The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this sets the frame to be all rows from the partition start up through the current row's last ORDER BY peer.

Exactly what you need.

Remaining weakness: The total is missing for intervals with no revenue. If that's not acceptable, we can use this technique to fix:

Carry over long sequence of missing values with Postgres

So:

SELECT interval_ts, campaign_revenue, total_revenue
     , coalesce(first_value(total_revenue) OVER (PARTITION BY grp ORDER BY interval_ts), 0) AS total_campaign_revenue    
FROM  (
   SELECT interval_ts
        , coalesce(revenue, 0) AS campaign_revenue
        , total_revenue
        , count(total_revenue) OVER (ORDER BY interval_ts) AS grp
   FROM   (
      SELECT interval_ts
           , coalesce(revenue, 0) AS campaign_revenue
           , count(total_revenue) OVER (ORDER BY interval_ts) AS grp
      FROM   generate_series($2::timestamp, $3::timestamp, $4) AS interval_ts
      LEFT   JOIN (
         SELECT date_trunc($4, created) AS interval_ts
              , SUM(revenue) AS revenue
              , SUM(SUM(revenue)) OVER (ORDER BY date_trunc($4, created)) AS total_running
         FROM   sale_event AS s
         WHERE  campaignid = $1
         AND    created <= $3                   -- read all relevant rows in one scan
         AND    event_type = 'session_closed'
         GROUP  BY date_trunc($4, created)
         ) results USING (interval_ts)
      ) sub1
   ) sub2;

With the added overhead, I am not sure it can compete. Still might if your selection is small and the table is big.

Minor notes:

You don't need parentheses around join conditions.
Don't call your timestamp "date". That's misleading. I use interval_ts istead of interval_date.
I'd rather not use the SQL keyword interval as column alias - even if that's allowed in Postgres.
- https://www.postgresql.org/docs/current/static/sql-keywords-appendix.html
Working with the same column alias interval_ts to allow the shorter USING syntax - which does require parentheses. This only exposes one instance of the joined columns interval_ts to the outer query, so the unqualified name still isn't ambiguous.
Don't omit the AS key word for column aliases (while that's ok for table aliases).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange