Maximum sum() of overlapping ranges

https://dba.stackexchange.com/questions/89864

13-12-2020
|

Question

Essentially my question is: how does one do aggregate operations involving overlapping ranges in PostgreSQL 9.3 (or 9.4)? The specific problem I have at hand is that given a range, I want to find the maximum sum() of applicable overlapping ranges. A simple example:

create table event (
  event_id int primary key,
  event_type_id int not null,
  period tstzrange not null,
  quantity int not null
);

insert into event (event_id, event_type_id, period, quantity) values
(1, 1,'[2016-01-06 09:00:00+00,2016-01-08 17:00:00+00]',1),
(2, 1,'[2016-01-07 09:00:00+00,2016-01-07 11:00:00+00]',1),
(3, 1,'[2016-01-07 13:00:00+00,2016-01-07 17:00:00+00]',1),
(4, 2,'[2016-01-07 12:00:00+00,2016-01-07 17:00:00+00]',1);

Given a query with the following clauses:

select ... 
where event_type_id = 1
and period && '[2016-01-07 00:00:00+00,2016-01-07 23:59:00+00]'::tstzrange 
group by event_type_id

The desired result would be: 3, i.e. the maximum sum(quantity) where the ranges of the same event_type_id overlap within a given timestamp range.

Solution

The task as I understand it

Pick rows from a table where the period overlaps with a given time frame. Determine distinct ranges of overlapping periods within that set and return the greatest sum(quantity) from any range.

Requires Postgres 9.2+, since there are no range types in older versions.

Assumptions

"Overlapping" is meant in a cascading manner, like tiles on a traditional roof: those are "overlapping" (rainproof), though the highest tile does not directly overlap with the lowest.
All values in period have inclusive bounds ([]). Else you have to adjust for exclusive bounds. (The range of the input parameter can still have arbitrary bounds.)
We filter for exactly one event_type_id. Else you have to add PARTITION BY event_type_id to the window definition.
quantity is an integer. Else you have to adjust for the type in calculations.
Quantities for overlapping periods are counted fully, even if parts of the period are outside your given time frame.
Even works for duplicates on (event_type_id, period).

Best performance with a single subquery

This should be dynamite.

SELECT running_sum - lag(running_sum, 1, 0) OVER (ORDER BY p_start) AS sum_quantity
FROM  (
   SELECT lower(period) AS p_start
        ,(sum(quantity)                      OVER w)::int AS running_sum
        , lead(lower(period), 1, 'infinity') OVER w
                        > max(upper(period)) OVER w AS range_end
   FROM   event
   WHERE  event_type_id = 1
   AND    period && '[2016-01-01 0:0+0,2016-01-10 0:0+0]'::tstzrange 
   WINDOW w AS (ORDER BY lower(period))
   ) sub
WHERE  range_end
ORDER  BY 1 DESC
LIMIT  1;

All three window functions in the subquery can use the same window. This avoids additional sort operations and should be fastest.

Verbose CTE variant with more explanation

Same query, just more verbose and slower, since CTEs materialize derived tables and pose as optimization barriers.

WITH cte1 AS (
   SELECT quantity
        , lower(period) AS p_start
        , upper(period) AS p_end
   FROM   event
   WHERE  event_type_id = 1
   AND    period && '[2016-01-01 0:0+0,2016-01-10 0:0+0]'::tstzrange
   )
,    cte2 AS (
   SELECT (sum(quantity)               OVER w)::int AS running_sum
        , lead(p_start, 1, 'infinity') OVER w               -- next start ..
                          > max(p_end) OVER w AS range_end  -- .. after last end
        , p_start, p_end
   FROM   cte1
   WINDOW w AS (ORDER BY p_start)
   )
SELECT running_sum - lag(running_sum, 1, 0) OVER (ORDER BY p_start) AS sum_quantity
                    -- subtract the previous sum to get the sum of this range
     , p_end::text
FROM   cte2
WHERE  range_end    -- only rows at the end of each range
ORDER  BY 1 DESC    -- biggest sum first
LIMIT  1;           -- only return the winner

sqlfiddle for Postgres 9.3
db<>fiddle here for Postgres 12

You need an index for this to be fast with big tables. The best option would be a GiST index on (event_type_id, period). Details:

Optimizing queries on a range of timestamps (two columns)

Explanation

Filter rows that match your conditions, then sort by the start of the time range (lower(period)) and calculate:

A running sum of quantity (running_sum).
The start of the next period: (lead(lower(period), 1, 'infinity')). Defaults to 'infinity' for the last row to include the last range.
The latest end of any period so far max(upper(period)).

If 2. is later than 3. it's the end of a (sub-)range (range_end).

In the outer SELECT filter rows with range_end and subtract the previous total to get the sum for the range. ORDER BY that result and return the first (LIMIT 1) greatest sum_quantity. Voilá.

Aside

To select all of Jan 7th, 2016, the clean expression is:

'[2016-01-07 00:00:00+00,2016-01-08 00:00:00+00)'::tstzrange

Not:

'[2016-01-07 00:00:00+00,2016-01-07 23:59:00+00]'::tstzrange

Details:

PostgreSQL splitting time range into days

Since the default precision of timestamp values is 6 decimal digits (microseconds resolution), you could also use:

'[2016-01-07 00:00:00+00,2016-01-07 23:59:59.999999+00]'::tstzrange

But that's messy and depends on an implementation detail that might change (even if unlikely). It's not subject to rounding errors, since timestamps are stored as integer values in modern Postgres:

Ignoring timezones altogether in Rails and PostgreSQL

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange