Question

I found solutions that find the next event date, but not ones that would include all the data from the event. With cheating, I can get it done, but that only works in mysql and fails in vertica.

Here is the problem I am trying to solve:

I want to show all events a with data from the first event X that follows a and is not of type a. So here is the cut and paste example so you can play with it to see what actually works:

CREATE TABLE events (user_id int ,created_at int, event varchar(20));
INSERT INTO events values (0,0, 'a');
INSERT INTO events values (0,1, 'b');
INSERT INTO events values (0,2, 'c');
INSERT INTO events values (0,3, 'a');
INSERT INTO events values (0,4, 'c');
INSERT INTO events values (0,5, 'b');
INSERT INTO events values (0,6, 'a');
INSERT INTO events values (0,7, 'a');
INSERT INTO events values (0,8, 'd');

SELECT * FROM events;
+---------+------------+-------+
| user_id | created_at | event |
+---------+------------+-------+
|       0 |          0 | a     |
|       0 |          1 | b     |
|       0 |          2 | c     |
|       0 |          3 | a     |
|       0 |          4 | c     |
|       0 |          5 | b     |
|       0 |          6 | a     |
|       0 |          7 | a     |
|       0 |          8 | d     |
+---------+------------+-------+
9 rows in set (0.00 sec)

Here is the result I know how to get in both, but I cannot seem to be able to get the event info in it as well:

SELECT user_id, MAX(purchased) AS purchased, spent 
FROM ( 
    SELECT
        e1.user_id AS user_id, e1.created_at AS purchased, 
        MIN(e2.created_at) AS spent
    FROM events e1, events e2
    WHERE
        e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
        e1.Event = 'a' AND e2.Event != 'a'
    GROUP BY e1.user_id, e1.created_at
) e3 GROUP BY user_id, spent;

 user_id | purchased | spent 
---------+-----------+-------
       0 |         0 |     1
       0 |         3 |     4
       0 |         7 |     8

Now if I want the event type in there as well, it does not work with the query above, because you either have to use the event field in the group-by (not what we want) or with an aggregate (not what we want either). Funny enough in mysql it works, but I consider that cheating and since I have to use vertica for this, it won't help me:

SELECT user_id, MAX(purchased) as purchased, spent, event FROM (
    SELECT 
        e1.User_ID AS user_id, 
        e1.created_at AS purchased, 
        MIN(e2.created_at) AS spent, 
        e2.event AS event 
    FROM events e1, events e2 
    WHERE 
        e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND 
        e1.Event = 'a' AND e2.Event != 'a' 
    GROUP BY
        e1.user_id,e1.created_at
 ) e3 GROUP BY user_id, spent;


+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
|       0 |         0 |     1 | b     |
|       0 |         3 |     4 | c     |
|       0 |         7 |     8 | d     |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)

For vertica the same query throws an error: ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function

Whats an elegant solution to get the two events paired up with all their columns and without cheating so that I can get the same result as shown above when executing in vertica or some other database that does not allow the cheat? In the sample data I have exactly one additional column I want, which is the event type, but in the real world it would be two or three columns.

Please try it out with the sample data posted before answering :)

Was it helpful?

Solution

Ok, while I'm not 100% sure I understand what you're trying to do, see if this won't work:

SELECT e3.user_id, MAX(e3.purchased) AS purchased, e3.spent, e.event
FROM ( 
    SELECT
        e1.user_id AS user_id, e1.created_at AS purchased, 
        MIN(e2.created_at) AS spent
    FROM events e1, events e2
    WHERE
        e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
        e1.Event = 'a' AND e2.Event != 'a'
    GROUP BY e1.user_id, e1.created_at
) e3 
 JOIN events e on e3.user_id = e.user_id and e3.spent = e.created_at
GROUP BY e3.user_id, e3.spent, e.event

Essentially I'm just joining on the event table again assuming user_id and created_at are your primary key.

And here is the SQL Fiddle.

OTHER TIPS

Try this...

With    Cte As
(
        Select  Row_Number() Over (Partition By [user_id] Order By [created_at]) As row_num,
                [user_id],
                [created_at],
                [event]
        From    [events]
)
Select  c1.[user_id],
        c1.[created_at] As purchased,
        c2.[created_at] As spent,
        c2.[event]
From    Cte c1
Left    Join Cte c2
        On  c1.row_num = c2.row_num - 1
Where   c1.event = 'a'
And     c2.event <> 'a'

I usually do the "next" calculations using correlated subqueries, and then join back to the original table. In this case, I am assuming that , uniquely defines a row.

Here is the query:

SELECT user_id, MAX(purchased) as purchased, spent, event
FROM (
    SELECT e.User_ID, e.created_at AS purchased, 
           MIN(enext.created_at) AS spent,
           min(enext.event) AS event 
    FROM (select e.*,
                 (select MIN(e2.created_at)
                  from event e2
                  where e2.user_id = e.user_id and e2.created_at > e.created_at and e2.event <> 'a'
                 ) nextcreatedat
          from events e
          where e.event = 'a'
         ) e left outer join
         events enext
         on e.user_id = enext.user_id and
            e.nextcreatedat = enext.create_at
    GROUP BY e.user_id, e.created_at
    ) e3
 GROUP BY user_id, spent;

The aggregation GROUP BY e.user_id, e.created_at is not necessary, but I've left it in to remain similar to the original query.

Because Vertica supports cumulative sums, there is a way to do this more efficiently, but it would not work in MySQL.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top