How do you select the rolling, most recent value for each person for each month in a range where that most recent value is less than 6 months old?
-
03-10-2020 - |
Question
I need help writing a SQL query for Redshift (Postgres will do) likely involving WINDOW functions, PARTITIONS, LAST_VALUE and other things above my head.
Users can submit survey responses at any time (usually every few months). After 6 months the score is no longer fresh/relevant.
DROP TABLE IF EXISTS users;
CREATE TABLE users (
user_id INTEGER UNIQUE,
user_name VARCHAR(50)
);
INSERT INTO users VALUES
(1, 'Steve Wozniak'),
(2, 'Steve Jobs'),
(3, 'Tony Ive');
DROP TABLE IF EXISTS responses;
CREATE TABLE responses (
response_id INTEGER UNIQUE,
user_id INTEGER,
date DATE,
score INTEGER
);
INSERT INTO responses VALUES
(1, 1, '2016-08-21', 2),
(2, 1, '2016-02-04', 8),
(3, 1, '2016-04-11', 4),
(4, 1, '2016-06-21', 10),
(5, 2, '2015-11-04', 9),
(6, 2, '2015-11-22', 8),
(7, 2, '2016-07-11', 10),
(8, 2, '2016-08-15', 2);
I would like to return a recordset grouped by month that contains the rolling, most recent score (LAST_VALUE) for each user up to that month if that response that is no older than 6 months old.
The result set would contain the following rows relating to User #1. NULL values shown for clarity but can be omitted:
+---------+------------+-------+
| User Id | Date | Score |
+---------+------------+-------+
| 1 | 2015-11-01 | NULL | <= No score submitted yet
| 1 | 2015-12-01 | NULL | <= No score submitted yet
| 1 | 2016-01-01 | NULL | <= No score submitted yet
| 1 | 2016-02-01 | NULL | <= No score submitted yet
| 1 | 2016-03-01 | 8 |
| 1 | 2016-04-01 | 8 |
| 1 | 2016-05-01 | 4 |
| 1 | 2016-06-01 | 4 |
| 1 | 2016-07-01 | 10 |
| 1 | 2016-08-01 | 10 |
| 1 | 2016-09-01 | 2 |
| 1 | 2016-10-01 | 2 |
+---------+------------+-------+
And for #2:
+---------+------------+-------+
| User Id | Date | Score |
+---------+------------+-------+
| 2 | 2015-11-01 | NULL | <= No score submitted yet
| 2 | 2015-12-01 | 8 |
| 2 | 2016-01-01 | 8 |
| 2 | 2016-02-01 | 8 |
| 2 | 2016-03-01 | 8 |
| 2 | 2016-04-01 | 8 |
| 2 | 2016-05-01 | 8 |
| 2 | 2016-06-01 | NULL | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2 | 2016-07-01 | NULL | <= 2015-11-22 SCORE OLDER THAN 6 MONTHS
| 2 | 2016-08-01 | 10 |
| 2 | 2016-09-01 | 2 |
| 2 | 2016-10-01 | 2 |
+---------+------------+-------+
All 12 months (or entire series) should be populated unless all NULL. Possibly use generate_series() in postgres or using a number table in Redshift (https://www.periscopedata.com/blog/generate-series-in-redshift-and-mysql.html)
The NULL values can be omitted (shown for clarity).
Ultimately I'm going to need to reproduce the same grouped by year and week of the year but I can probably extrapolate those queries if I can figure out this.
Solution
2015-11-22 is not older than 6 months compared to 2015-05-01 unless you truncate the response date to month also.
> select '2015-11-22'::date + interval '6 months';
+---------------------+
| ?column? |
|---------------------|
| 2016-05-22 00:00:00 |
+---------------------+
> select date_trunc('month', '2015-11-22'::date) + interval '6 months';
+---------------------------+
| ?column? |
|---------------------------|
| 2016-05-01 00:00:00+02:00 |
+---------------------------+
So depending the logic what you consider expired the query changes a bit. I've included both version. In comment the date and active condition as per the example.
Using Lateral join:
with users as (
-- I assume you have user table so this can be omitted
-- first and last are used to limit the join
select user_id, min(date) as first, max(date) as last from responses group by user_id
), boundaries as (
select
date_trunc('month', min(date)) as low
--, date_trunc('month', max(date)) as high
-- If you want to use the high in generate_series as upper boundary
from responses
)
select user_id, tick as "date", score
from
generate_series(
(select low from boundaries),
date_trunc('month', current_date + interval '1 month'), interval '1 month'
) as tick
join users on tick.tick between users.first and users.last + interval '6 months'
-- # If you omit the users CTE and you have a users table
-- cross join users
left outer join lateral (
select score from responses
where users.user_id = responses.user_id
-- # Proper 6 months calculation
--and responses.date between tick.tick - interval '6 months' and tick.tick
-- # As example showed
and responses.date < tick.tick and date_trunc('month', responses.date) + interval '6 months' > tick.tick
order by responses.date desc limit 1
) a on true
order by 1, 2;
Using Window functions:
with boundaries as (
select
date_trunc('month', min(date)) as low
from responses
)
select distinct responses.user_id, tick as "date", first_value(score) over (partition by responses.user_id, tick.tick order by responses.date desc)
from
generate_series(
(select low from boundaries),
date_trunc('month', current_date + interval '1 month'), interval '1 month'
) as tick
join responses on (
-- # Proper 6 months calculation
-- responses.date between tick.tick - interval '6 months' and tick.tick
-- # As it was in the example
responses.date < tick.tick and
date_trunc('month', responses.date) > tick.tick - interval '6 months'
)
order by 1, 2;