Question

My problem is pretty general and can probably be solved in many ways. But what is a smart way considering time and memory?

I have time series data of user interactions of the following form:

cookie_id     interaction
---------     -----------
1234          did_something
1234          viewed_banner*
1234          did_something
1234          did_something
1234          viewed_and_clicked_banner*
...           ...

I want it to train models predicting whether a user will click on a banner or not whenever a banner is displayed (i.e. the interactions marked with *). To do this I need to aggregate all previous interactions whenever a point of interest (either viewed_banner or viewed_and clicked_banner) shows up in the feed:

cookie_id     interaction
---------     -----------
1234          did_something
1234          viewed_banner               <- point of interest

cookie_id     interaction
---------     -----------
1234          did_something
1234          viewed_banner
1234          did_something
1234          did_something
1234          viewed_and_clicked_banner   <- point of interest

This is the core of the problem: Splitting the data up into overlapping groups! After doing this each group can then be aggregated into for instance:

cookie_id   did_something   viewed_banner   viewed_and_cli...   clicked?
---------   -------------   -------------   -----------------   --------
1234        1               0               0                   no
1234        3               1               0                   yes

Here the numbers in did_something and viewed_banner are the counts of these interaction (not including the point of interest), but other kind of aggregation could be performed as well. The clicked? attribute just describes which of the two kinds of "point of interest" was the last interaction in the interaction feed.

I have tried to look at Pandas apply and groupby methods, but can not come up with something that generates the desired overlapping groups.

The alternative is to use some for-loops, but I would rather not do that if there is a simple and efficient way to solve the problem.

Was it helpful?

Solution

Here is what I tried, I think it need more data to verify the code:

data = """cookie_id     interaction
1234          did_something
1234          viewed_banner*
1234          did_something
1234          did_something
1234          viewed_and_clicked_banner*
"""

import pandas as pd
import io

df = pd.read_csv(io.BytesIO(data), delim_whitespace=True)
flag = df.interaction.str.endswith("*")
group_flag = flag.astype(float).mask(~flag).ffill(limit=1).fillna(0).cumsum()
df["interaction"] = df.interaction.str.rstrip("*")
interest_df = df[flag]

def f(s):
    return s.value_counts()

df2 = df.groupby(group_flag).interaction.apply(f).unstack().fillna(0).cumsum()
result = df2[::2].reset_index(drop=True)
result["clicked"] = interest_df.interaction.str.contains("clicked").reset_index(drop=True)
print result

output:

  did_something  viewed_and_clicked_banner  viewed_banner clicked
0              1                          0              0   False
1              3                          0              1    True

The basic idea is split the dataframe into groups:

  • odd groups are continuous rows without *
  • even groups are only one row with *

It assume that the first row in the dataframe is without *.

Then do value_counts for every group and combine the results into a dataframe. cumsum() the counts and drop even rows will get the right counts.

I don't know how the clicked column is calculated. Can you explain this in detail?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top