Fastest way to count how many dates' ranges cover each date from series

https://dba.stackexchange.com/questions/208241

02-01-2021
|

Question

I have a table (in PostgreSQL 9.4) that looks like this:

CREATE TABLE dates_ranges (kind int, start_date date, end_date date);
INSERT INTO dates_ranges VALUES 
    (1, '2018-01-01', '2018-01-31'),
    (1, '2018-01-01', '2018-01-05'),
    (1, '2018-01-03', '2018-01-06'),
    (2, '2018-01-01', '2018-01-01'),
    (2, '2018-01-01', '2018-01-02'),
    (3, '2018-01-02', '2018-01-08'),
    (3, '2018-01-05', '2018-01-10');

Now I want to calculate for the given dates and for every kind, into how many rows from dates_ranges each date falls. Zeros could be possibly omitted.

Desired result:

+-------+------------+----+
|  kind | as_of_date |  n |
+-------+------------+----+
|     1 | 2018-01-01 |  2 |
|     1 | 2018-01-02 |  2 |
|     1 | 2018-01-03 |  3 |
|     2 | 2018-01-01 |  2 |
|     2 | 2018-01-02 |  1 |
|     3 | 2018-01-02 |  1 |
|     3 | 2018-01-03 |  1 |
+-------+------------+----+

I've come up with two solutions, one with LEFT JOIN and GROUP BY

SELECT
kind, as_of_date, COUNT(*) n
FROM
    (SELECT d::date AS as_of_date FROM generate_series('2018-01-01'::timestamp, '2018-01-03'::timestamp, '1 day') d) dates
LEFT JOIN
    dates_ranges ON dates.as_of_date BETWEEN start_date AND end_date
GROUP BY 1,2 ORDER BY 1,2

and one with LATERAL, which is slightly faster:

SELECT
    kind, as_of_date, n
FROM
    (SELECT d::date AS as_of_date FROM generate_series('2018-01-01'::timestamp, '2018-01-03'::timestamp, '1 day') d) dates,
LATERAL
    (SELECT kind, COUNT(*) AS n FROM dates_ranges WHERE dates.as_of_date BETWEEN start_date AND end_date GROUP BY kind) ss
ORDER BY kind, as_of_date

I'm wondering is it any better way to write this query? And how to include pairs date-kind with 0 count?

In reality there is a few distinct kinds, period of up to five years (1800 dates), and ~30k rows in dates_ranges table (but it could grow significantly).

There are no indexes. To be precise in my case it's a result of subquery, but I've wanted to limit question to one issue, so it's more general.

Solution

The following query also works if "missing zeros" are OK:

select *
from (
  select
    kind,
    generate_series(start_date, end_date, interval '1 day')::date as d,
    count(*)
  from dates_ranges
  group by 1, 2
) x
where d between date '2018-01-01' and date '2018-01-03'
order by 1, 2;

but it's not any faster than the lateral version with the small dataset. It might scale better though, as no join is required, but the above version aggregates over all the rows, so it may lose out there again.

The following query tries to avoid unnecessary work by removing any series that don't overlap anyway:

select
  kind,
  generate_series(greatest(start_date, date '2018-01-01'), least(end_date, date '2018-01-03'), interval '1 day')::date as d,
  count(*)
from dates_ranges
where (start_date, end_date + interval '1 day') overlaps (date '2018-01-01', date '2018-01-03' + interval '1 day')
group by 1, 2
order by 1, 2;

-- and I got to use the overlaps operator! Note that you have to add interval '1 day' to the right as the overlaps operator considers time periods to be open on the right (which is fairly logical because a date is often considered to be a timestamp with time component of midnight).

OTHER TIPS

And how to include pairs date-kind with 0 count?

Build a grid of all combinations then LATERAL join to your table, like this:

SELECT k.kind, d.as_of_date, c.n
FROM  (SELECT DISTINCT kind FROM dates_ranges) k
CROSS  JOIN (
   SELECT d::date AS as_of_date
   FROM   generate_series(timestamp '2018-01-01', timestamp '2018-01-03', interval '1 day') d
   ) d
CROSS  JOIN LATERAL (
   SELECT count(*)::int AS n
   FROM   dates_ranges
   WHERE  kind = k.kind
   AND    d.as_of_date BETWEEN start_date AND end_date
   ) c
ORDER  BY k.kind, d.as_of_date;

Should also be as fast as possible.

I had LEFT JOIN LATERAL ... on true at first, but there is an aggregate in the subquery c, so we always get a row and can use CROSS JOIN as well. No difference in performance.

If you have a table holding all relevant kinds, use that instead of generating the list with subquery k.

The cast to integer is optional. Else you get bigint.

Indexes would help, especially a multicolumn index on (kind, start_date, end_date). Since you are building on a subquery, this may or may not be possible to achieve.

Using set-returning functions like generate_series() in the SELECT list is generally not advisable in Postgres versions before 10 (unless you know exactly what you are doing). See:

What is the expected behaviour for multiple set-returning functions in select clause?

If you have lots of combinations with few or no rows, this equivalent form may be faster:

SELECT k.kind, d.as_of_date, count(dr.kind)::int AS n
FROM  (SELECT DISTINCT kind FROM dates_ranges) k
CROSS JOIN (
   SELECT d::date AS as_of_date
   FROM   generate_series(timestamp '2018-01-01', timestamp '2018-01-03', interval '1 day') d
   ) d
LEFT   JOIN dates_ranges dr ON dr.kind = k.kind
                           AND d.as_of_date BETWEEN dr.start_date AND dr.end_date
GROUP  BY 1, 2
ORDER  BY 1, 2;

Using the `daterange` type

PostgreSQL has a daterange. Using it is pretty simple. Starting with your sample data we move to use the type on the table.

BEGIN;
  ALTER TABLE dates_ranges ADD COLUMN myrange daterange;
  UPDATE dates_ranges
    SET myrange = daterange(start_date, end_date, '[]');
  ALTER TABLE dates_ranges
    DROP COLUMN start_date,
    DROP COLUMN end_date;
COMMIT;

-- Now you can create GIST index on it...
CREATE INDEX ON dates_ranges USING gist (myrange);

TABLE dates_ranges;
 kind |         myrange         
------+-------------------------
    1 | [2018-01-01,2018-02-01)
    1 | [2018-01-01,2018-01-06)
    1 | [2018-01-03,2018-01-07)
    2 | [2018-01-01,2018-01-02)
    2 | [2018-01-01,2018-01-03)
    3 | [2018-01-02,2018-01-09)
    3 | [2018-01-05,2018-01-11)
(7 rows)

I want to calculate for the given dates and for every kind, into how many rows from dates_ranges each date falls.

Now to query it we reverse the procedure, and generate a date series but here's the catch the query itself can use the containment (@>) operator to check that the dates are in range, using an index.

Note we use timestamp without time zone (to stop DST hazards)

SELECT d1.kind, day::date, count(d2.kind)
FROM dates_ranges AS d1
CROSS JOIN LATERAL generate_series(
  lower(myrange)::timestamp without time zone,
  upper(myrange)::timestamp without time zone,
  '1 day'
) AS gs(day)
INNER JOIN dates_ranges AS d2
  ON d2.myrange @> day::date
GROUP BY d1.kind, day;

Which is the itemized day-overlaps on the index.

As a side bonus, with the daterange type you can stop insertions of ranges that overlap with others using an EXCLUDE CONSTRAINT

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange

Fastest way to count how many dates' ranges cover each date from series

Using the daterange type

Using the `daterange` type