Question

I have daily timeseries for companies in my dataset and use PostgreSQL.
For every company all rows with NULL in column3 shall be deleted until the first NOT NULL entry in this column for this company. Then all consecutive missing values are filled in with the value of the last observable value for this company that is NOT NULL.

You can imagine the following example data:

        date           company        column3
1 2004-01-01             A               5  
2 2004-01-01             B               NULL
3 2004-01-01             C               NULL  
4 2004-01-02             A               NULL
5 2004-01-02             B               7
6 2004-01-02             C               NULL
7 2004-01-03             A               6
8 2004-01-03             B               7
9 2004-01-03             C               9    
10 2004-01-04            A               NULL
11 2004-01-04            B               NULL
12 2004-01-04            C               NULL

It would be great if I manage to write a query that delivers

        date           company        column3
1 2004-01-01             A               5  
2 2004-01-02             A               5
3 2004-01-02             B               7
4 2004-01-03             A               6
5 2004-01-03             B               7
6 2004-01-03             C               9
7 2004-01-04             A               6
8 2004-01-04             B               7
9 2004-01-04             C               9

I tried:

SELECT a.date, a.company, COALESCE(a.column3, (SELECT b.column3 FROM mytable b 
WHERE b.company=a.company AND b.colmun3 IS NOT NULL ORDER BY b.company=a.company 
DESC LIMIT 1)) FROM mytable a;

There are two problems with the code:

  1. It does not delete all records with NULL values until the first NOT NULL value, but
    fills in all missing values.
  2. ...with the first observation in the column and not with the last observation before
    the missing value.
Was it helpful?

Solution

I suggest two subquery levels with window functions instead of correlated subqueries:

SELECT *
FROM  (
   SELECT the_date, company, max(col3) OVER (PARTITION BY company, grp) AS col3
   FROM (
      SELECT *, count(col3) OVER (PARTITION BY company ORDER BY the_date) AS grp
      FROM   tbl
      ) sub1
   ) sub2
WHERE  col3 IS NOT NULL
ORDER  BY the_date, company;

Produces the requested result.

db<>fiddle here
Old sqlfiddle

This assumes unique entries per (company, the_date). Should be much faster for tables with more than just a few rows. A (unique to enforce uniqueness?!) index would help performance a lot:

CREATE INDEX tbl_company_date_idx ON tbl (company, the_date);

How?

The aggregate function count() ignores NULL values when counting. Used as aggregate-window function, it computes the running count of a the column according to the default window definition, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This results in the count being "stuck" for rows with NULL values, thereby forming a peer group that shares the same (non-null) value.

In the second window function, the only non-null value per group is easily extracted with max(). The group before the first non-null value retains NULL, which is easily eliminated in the final SELECT.

See:

OTHER TIPS

Try:

SELECT *
FROM (
  SELECT id,
         date,
         company,
         case when column3 is not null
              then column3
              else (
                     SELECT column3
                     FROM mytable t1
                     WHERE t1.company = t.company
                       AND t1.date < t.date
                       AND t1.column3 IS NOT NULL
                     ORDER BY t1.date DESC LIMIT 1
                    )
          end column3
  FROM mytable T
) AS subq
WHERE column3 IS NOT NULL;

demo: http://sqlfiddle.com/#!15/0cdce/12

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top