Fill missing dates within groups

https://dba.stackexchange.com/questions/242689

06-02-2021
|

Question

I have the following table with values for different stations from 2014-01-01 to 2014-01-04. The data has some date gaps that I want to fill leaving the value as NULL, and assigning the missing date to each station. I'm working with PostgreSQL 10.9

This is my table:

CREATE TABLE stations (station_id text, value integer, date date);
INSERT INTO stations (station_id, value, date) VALUES 
('001', 10, '2014-01-01'),
('001', 30, '2014-01-03'),
('002', 40, '2014-01-01'),
('002', 50, '2015-01-02'),
('003', 20, '2014-01-01'),
('003', 10, '2015-01-02'),
('003', 70, '2015-01-04');

I also have a table holding unique stations with identifiers.

And I want something like this:

| station | value | date       |
|---------|-------|------------|
| 001     | 10    | 2014-01-01 |
| 001     | NULL  | 2014-01-02 |
| 001     | 30    | 2014-01-03 |
| 001     | NULL  | 2014-01-04 |
| 002     | 40    | 2014-01-01 |
| 002     | 50    | 2014-01-02 |
| 002     | NULL  | 2014-01-03 |
| 002     | NULL  | 2014-01-04 |
| 003     | 20    | 2014-01-01 |
| 003     | 10    | 2014-01-02 |
| 003     | NULL  | 2014-01-03 |
| 003     | 70    | 2014-01-04 |

Following some DBA Exchange (questions)1, I tried a combination of a LEFT JOIN with a LATERAL JOIN:

WITH complete_dates_station AS (
    select station_id,
           generate_series(DATE '2014-01-01', DATE '2014-12-31', INTERVAL '1 day')::DATE as dt
    FROM stations
    GROUP by station_id
    ), temp_join AS (
        SELECT station_id,
               dt,
               s.value
        FROM complete_dates_station
            LEFT JOIN LATERAL (
                SELECT s.value
                FROM stations s
                WHERE s.station_id = complete_dates_station.station_id
                AND s.date = complete_dates_station.dt
                ORDER by s.station_id, date desc
                LIMIT 1) as s on TRUE
             ORDER BY station_id, dt
         ) SELECT * from temp_join

This works like a charm, but this join is really slow for my complete table, which has more than 2M rows and the date range goes over 18 years (I stopped after 4 hrs of running). I tried a simpler approach by using a regular LEFT JOIN, but the table outputs the not-joined groups as missings:

WITH complete_dates_station AS (
    SELECT station_id,
           generate_series(date '2014-01-01', date '2014-12-31', interval '1 day')::date as dt
    from stations
    GROUP BY station_id)
SELECT s.station_id,
       c.dt,
       s.value
FROM complete_dates_station c
    left outer join stations s
    on c.station_id = s.station_id and
    c.dt = s.date;

which yields the following:

| station | value | date       |
|---------|-------|------------|
| 001     | 10    | 2014-01-01 |
| NULL    | NULL  | 2014-01-02 |
| 001     | 30    | 2014-01-03 |
| NULL    | NULL  | 2014-01-04 |
| 002     | 40    | 2014-01-01 |
| 002     | 50    | 2014-01-02 |
| NULL    | NULL  | 2014-01-03 |
| NULL    | NULL  | 2014-01-04 |
| 003     | 20    | 2014-01-01 |
| 003     | 10    | 2014-01-02 |
| NULL    | NULL  | 2014-01-03 |
| 003     | 70    | 2014-01-04 |

There is any way to optimize the first query, or use a simpler approach to fill my station gaps in the second query? I tried already using multicolumn indexes in my source table, but the query is still taking a lot of time.

Solution

You also have a table holding unique stations with identifiers, Could look like this:

CREATE TABLE uniq_stations (station_id text);
INSERT INTO uniq_stations VALUES
('001'),
('002'),
('003');

There will be more columns, which are irrelevant for us.

This should be much faster then:

SELECT station_id, s.value, date
FROM   uniq_stations u
CROSS  JOIN (
   SELECT generate_series (timestamp '2014-01-01'
                         , timestamp '2014-01-04'
                         , interval  '1 day')::date
   ) d(date)
LEFT   JOIN stations s USING (station_id, date)
ORDER  BY station_id, date;  -- optional

db<>fiddle here

You do not need a LATERAL join at all, the date series is the same for every station. Only a CROSS JOIN to build the complete Cartesian product of stations and days, then a LEFT [OUTER] JOIN to existing combinations in table stations (an unfortunate table name for its content, btw.). LATERAL joins are great, when needed. But plain joins are faster.

Also, this fills in stations with all days missing, which would not work at all without uniq_stations. You may or may not have such cases.

One of the expensive pieces in this puzzle is identifying unique stations. A task we can skip completely if we can use the added uniq_stations providing what we need. Else we might use DISTINCT ON or a recursive CTE to make use of a matching index. See:

Still more expensive than reading unique rows from a table, but much faster already than what you had. Which was a grand waste of CPU cycles, frankly.

Finally, a multicolumn index on stations (station_id, date) should deliver top notch performance even with a big table stations. (Using a higher percentage of rows from that table makes the index less important.)

There is a reason I use generate_series (timestamp, timestamp, interval):

Generating time series between two dates in PostgreSQL

Aside: Your station_id should probably type integer. Faster than text, and smaller, too if numbers go beyond 999.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange