How do you select the rolling, most recent value for each person for each month in a range where that most recent value is less than 6 months old?

https://dba.stackexchange.com/questions/148423

03-10-2020
|

Question

I need help writing a SQL query for Redshift (Postgres will do) likely involving WINDOW functions, PARTITIONS, LAST_VALUE and other things above my head.

Users can submit survey responses at any time (usually every few months). After 6 months the score is no longer fresh/relevant.

DROP TABLE IF EXISTS users;
CREATE TABLE users (
    user_id       INTEGER UNIQUE,
    user_name     VARCHAR(50)
);
INSERT INTO users VALUES
    (1, 'Steve Wozniak'),
    (2, 'Steve Jobs'),
    (3, 'Tony Ive');

DROP TABLE IF EXISTS responses;
CREATE TABLE responses (
    response_id     INTEGER UNIQUE,
    user_id         INTEGER,
    date            DATE,      
    score           INTEGER
);
INSERT INTO responses VALUES
    (1, 1, '2016-08-21', 2),
    (2, 1, '2016-02-04', 8),
    (3, 1, '2016-04-11', 4),
    (4, 1, '2016-06-21', 10),
    (5, 2, '2015-11-04', 9),
    (6, 2, '2015-11-22', 8),
    (7, 2, '2016-07-11', 10),
    (8, 2, '2016-08-15', 2);

I would like to return a recordset grouped by month that contains the rolling, most recent score (LAST_VALUE) for each user up to that month if that response that is no older than 6 months old.

The result set would contain the following rows relating to User #1. NULL values shown for clarity but can be omitted:

+---------+------------+-------+
| User Id | Date       | Score |
+---------+------------+-------+
| 1       | 2015-11-01 | NULL  | <= No score submitted yet
| 1       | 2015-12-01 | NULL  | <= No score submitted yet
| 1       | 2016-01-01 | NULL  | <= No score submitted yet 
| 1       | 2016-02-01 | NULL  | <= No score submitted yet
| 1       | 2016-03-01 | 8     |
| 1       | 2016-04-01 | 8     |
| 1       | 2016-05-01 | 4     |
| 1       | 2016-06-01 | 4     |
| 1       | 2016-07-01 | 10    |
| 1       | 2016-08-01 | 10    |
| 1       | 2016-09-01 | 2     |
| 1       | 2016-10-01 | 2     |
+---------+------------+-------+

And for #2:

+---------+------------+-------+
| User Id | Date       | Score |
+---------+------------+-------+
| 2       | 2015-11-01 | NULL  | <= No score submitted yet
| 2       | 2015-12-01 | 8     |
| 2       | 2016-01-01 | 8     |  
| 2       | 2016-02-01 | 8     |
| 2       | 2016-03-01 | 8     |
| 2       | 2016-04-01 | 8     |
| 2       | 2016-05-01 | 8     |
| 2       | 2016-06-01 | NULL  | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2       | 2016-07-01 | NULL  | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2       | 2016-08-01 | 10    | 
| 2       | 2016-09-01 | 2     |
| 2       | 2016-10-01 | 2     |
+---------+------------+-------+

All 12 months (or entire series) should be populated unless all NULL. Possibly use generate_series() in postgres or using a number table in Redshift (https://www.periscopedata.com/blog/generate-series-in-redshift-and-mysql.html)
The NULL values can be omitted (shown for clarity).

Ultimately I'm going to need to reproduce the same grouped by year and week of the year but I can probably extrapolate those queries if I can figure out this.

Solution

2015-11-22 is not older than 6 months compared to 2015-05-01 unless you truncate the response date to month also.

> select '2015-11-22'::date + interval '6 months';
+---------------------+
| ?column?            |
|---------------------|
| 2016-05-22 00:00:00 |
+---------------------+

> select date_trunc('month', '2015-11-22'::date) + interval '6 months';
+---------------------------+
| ?column?                  |
|---------------------------|
| 2016-05-01 00:00:00+02:00 |
+---------------------------+

So depending the logic what you consider expired the query changes a bit. I've included both version. In comment the date and active condition as per the example.

Using Lateral join:

with users as (
    -- I assume you have user table so this can be omitted
    -- first and last are used to limit the join
    select user_id, min(date) as first, max(date) as last from responses group by user_id
), boundaries as (
    select 
        date_trunc('month', min(date)) as low 
        --, date_trunc('month', max(date)) as high 
        -- If you want to use the high in generate_series as upper boundary
    from responses
)
select user_id, tick as "date", score 
from 
    generate_series(
        (select low from boundaries), 
        date_trunc('month', current_date + interval '1 month'), interval '1 month'
    ) as tick
join users on tick.tick between users.first and users.last + interval '6 months'
-- # If you omit the users CTE and you have a users table
-- cross join users
left outer join lateral (
    select score from responses 
    where users.user_id = responses.user_id 
    -- # Proper 6 months calculation
    --and responses.date between tick.tick - interval '6 months' and tick.tick 
    -- # As example showed
    and responses.date < tick.tick and date_trunc('month', responses.date) + interval '6 months' > tick.tick 
    order by responses.date desc limit 1
) a on true
order by 1, 2;

Using Window functions:

with boundaries as (
    select 
        date_trunc('month', min(date)) as low
    from responses
)
select distinct responses.user_id, tick as "date", first_value(score) over (partition by responses.user_id, tick.tick order by responses.date desc)
from 
    generate_series(
        (select low from boundaries), 
        date_trunc('month', current_date + interval '1 month'), interval '1 month'
    ) as tick
join responses on (
    -- # Proper 6 months calculation
    -- responses.date between tick.tick - interval '6 months' and tick.tick 
    -- # As it was in the example
    responses.date < tick.tick and 
    date_trunc('month', responses.date) > tick.tick - interval '6 months'
) 
order by 1, 2;

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange