How to enrich event based data per row based on future events

https://dba.stackexchange.com/questions/137827

01-10-2020
|

Question

I'm trying to figure out a not too terribly hacky way to solve the following issue. This data is currently available both in MySQL and RedShift, so a solution for either is fine though RedShift is preferable.

Lets say I have a data set like this:

+----------+---------+---------------------+-----------+-------------+
| event_id | user_id | event_date          | meta_id_2 | meta_bool_1 |
+----------+---------+---------------------+-----------+-------------+
| 22829501 |       4 | 2016-04-23 09:30:00 |      1035 |           1 |
| 22829499 |       4 | 2016-04-23 09:30:00 |      1035 |           1 |
| 22896804 |       4 | 2016-04-25 09:30:00 |      1029 |           1 |
| 22717814 |       4 | 2016-04-26 08:30:00 |         1 |           1 |
| 22717817 |       4 | 2016-04-26 08:30:00 |         1 |           1 |
| 22717815 |       4 | 2016-04-26 08:30:00 |         1 |           1 |
| 22841023 |       4 | 2016-04-27 09:30:00 |        20 |           1 |
| 22841025 |       4 | 2016-04-27 09:30:00 |        20 |           1 |
| 23034222 |       4 | 2016-04-27 09:30:00 |        20 |           1 |
| 23073873 |       4 | 2016-04-30 08:30:00 |      1037 |           1 |
| 23072919 |       4 | 2016-05-03 08:00:00 |        19 |           1 |
| 23072922 |       4 | 2016-05-03 08:00:00 |        19 |           1 |
| 23072918 |       4 | 2016-05-03 08:00:00 |        19 |           1 |
| 23219747 |       4 | 2016-05-05 08:30:00 |         1 |           1 |
| 23219810 |       4 | 2016-05-06 08:30:00 |      1029 |           1 |
| 23219737 |       4 | 2016-05-08 09:45:00 |         5 |           1 |
| 23201307 |       4 | 2016-05-09 08:30:00 |      1029 |           1 |
| 23201309 |       4 | 2016-05-09 08:30:00 |      1029 |           1 |
| 22992337 |       7 | 2016-04-26 08:30:00 |         1 |           1 |
| 23016519 |       7 | 2016-04-29 08:30:00 |         4 |           1 |
| 23073876 |       7 | 2016-04-30 08:30:00 |      1037 |           1 |
| 22854488 |       7 | 2016-05-25 09:30:00 |        20 |           1 |
| 22854485 |       7 | 2016-05-25 09:30:00 |        20 |           1 |
| 22172836 |       9 | 2016-04-26 08:30:00 |         1 |           0 |
| 22172835 |       9 | 2016-04-30 09:30:00 |      1029 |           0 |
| 23199467 |       9 | 2016-05-03 08:30:00 |         1 |           0 |
| 23256119 |       9 | 2016-05-06 12:30:00 |      1029 |           0 |
| 23240659 |       9 | 2016-05-07 09:30:00 |      1029 |           0 |
| 23240629 |       9 | 2016-05-10 08:30:00 |         1 |           0 |
| 23240657 |       9 | 2016-05-14 09:30:00 |      1029 |           0 |
| 23240634 |       9 | 2016-05-17 08:30:00 |         1 |           0 |
| 23240654 |       9 | 2016-05-21 09:30:00 |      1029 |           0 |
| 23240635 |       9 | 2016-05-24 08:30:00 |         1 |           0 |
| 23240650 |       9 | 2016-05-28 09:30:00 |      1029 |           0 |
| 23240637 |       9 | 2016-05-31 08:30:00 |         1 |           0 |
| 23240642 |       9 | 2016-06-04 09:30:00 |      1029 |           0 |
| 22898124 |      10 | 2016-04-25 10:30:00 |         1 |           0 |
| 23032733 |      10 | 2016-04-27 08:30:00 |         1 |           0 |
| 23072866 |      10 | 2016-04-29 18:00:00 |         1 |           0 |
| 23092129 |      10 | 2016-05-02 19:30:00 |         1 |           0 |
+----------+---------+---------------------+-----------+-------------+

I want to have four additional columns for each row with a true false indicating whether or not there is one or more rows within 30 days of the current row (for the same user_id) with the same value for meta_id_2 and then another column indicating whether there was is one or more rows within 30 days (for the same user_id) containing the same value for meta_bool_1. The end result should have run row per event with the additional columns mentioned above.

I had started off doing something like this:

SELECT a.event_id,
  EXISTS(SELECT 1 FROM events b WHERE a.user_id = b.user_id AND DATEDIFF(day, b.event_date, a.event_date) <= 30 AND b.meta_id_2 = a.meta_id_2) AS has_same_meta_id_2_in_30
  FROM events a;

This works until you try and do more than one exists and redshift will not support the correlated subquery pattern. MySQL is incredibly slow on a large table doing this for obvious reasons.

Does anyone have a solution that might work for this type of thing?

Solution

As the two dates are compared inside a function this may prevent the optimiser using an index on the date. Changing to

b.event_date <= DATE_ADD(a.event_date INTERVAL 30 DAY)

may help (or try it the other way around).

Refactoring the query as an outer join will likely produce a different execution plan

SELECT
  a.event_id,
  <other columns>
FROM events a
Left join events b
  On a.user_id = b.user_id 
  AND DATEDIFF(day, b.event_date, a.event_date) <= 30 
  AND b.meta_id_2 = a.meta_id_2

This pattern can be repeated for subsequent comparisons.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange