Delete rows with specific values in one column and a missing value in another column of the same or consecutive row

StackOverflow https://stackoverflow.com/questions/21997272

Question

My dataset consists of daily timeseries for different companies and I work with PostgreSQL. I have a indicator variable in my dataset, taking values 1, -1 and most of the time 0. If the indicator variable is not 0, and a company has a missing value in another column on that day (indicator day) or the next day, the company shall be excluded from the dataset entirely.

We can think of the following example data:

date             company   indicator   value
2012-01-02       A         0           2
2012-01-02       B         0           9
2012-01-02       C         0           1
2012-01-02       D         0           3
2012-01-03       A         1           NULL
2012-01-03       B         0           NULL
2012-01-03       C        -1           1
2012-01-03       D         0           2
2012-01-04       A         0           1
2012-01-04       B         0           1
2012-01-04       C         0           NULL
2012-01-04       D         1           4
2012-01-05       A         0           4
2012-01-05       B         0           2
2012-01-05       C         0           1
2012-01-05       D         0           7

So A has to be excluded because it has a missing value on the indicator day, and C because it has a missing value on the day following the indicator day.

I tried the following:

    CREATE TABLE to_delete
    AS SELECT * FROM mytable
    WHERE company IN(
                   SELECT company 
                   FROM mytable 
                   WHERE date BETWEEN (SELECT date FROM mytable WHERE indicator != 0)
                          AND (SELECT date+1 FROM mytable WHERE indicator != 0) 
                   AND indicator != 0)
    AND date BETWEEN (SELECT date FROM mytable WHERE indicator != 0)
                 AND (SELECT date+1 FROM mytable WHERE indicator != 0) 

    DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_delete);

It works if there exists only one indicator value that is unequal to zero in the example dataset. With more than one, PostgreSQL returns an error, saying that my subquery returned more than one row.

I am really struggling to fix this problem. Do you know a solution, or may be a completely other approach to achieve the wanted result?

Was it helpful?

Solution

I would largely simplify with an EXISTS semi-join.

Delete only offending rows

SELECT * FROM tbl t
-- DELETE FROM tbl t
WHERE  indicator <> 0
AND EXISTS (
   SELECT 1
   FROM   tbl t1
   WHERE  day IN (t.day, t.day + 1)
   AND    t1.company = t.company
   AND    t1.value IS NULL
   )

-> SQLfiddle

Using column name day instead of date, cause I never use basic type names as identifiers.

day + 1 works if day is of data type date (as it should be).

Delete the whole company

the company shall be excluded from the dataset entirely.

DELETE FROM tbl t
USING (
    SELECT DISTINCT company
    FROM   tbl t
    WHERE  indicator <> 0
    AND EXISTS (
        SELECT 1
        FROM   tbl t1
        WHERE  day IN (t.day, t.day + 1)
        AND    t1.company = t.company
        AND    t1.value IS NULL
        )
    ) del
WHERE t.company = del.company

-> SQLfiddle

OTHER TIPS

DELETE FROM test WHERE company IN (
WITH 
    for_check AS (SELECT date, company FROM test WHERE indicator != 0)
SELECT test.company 
FROM test 
INNER JOIN for_check fc 
    ON test.date IN (fc.date, fc.date + 1) 
        AND fc.company = test.company 
WHERE test.value IS NULL
)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top