Here is what I tried, I think it need more data to verify the code:
data = """cookie_id interaction
1234 did_something
1234 viewed_banner*
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner*
"""
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data), delim_whitespace=True)
flag = df.interaction.str.endswith("*")
group_flag = flag.astype(float).mask(~flag).ffill(limit=1).fillna(0).cumsum()
df["interaction"] = df.interaction.str.rstrip("*")
interest_df = df[flag]
def f(s):
return s.value_counts()
df2 = df.groupby(group_flag).interaction.apply(f).unstack().fillna(0).cumsum()
result = df2[::2].reset_index(drop=True)
result["clicked"] = interest_df.interaction.str.contains("clicked").reset_index(drop=True)
print result
output:
did_something viewed_and_clicked_banner viewed_banner clicked
0 1 0 0 False
1 3 0 1 True
The basic idea is split the dataframe into groups:
- odd groups are continuous rows without
*
- even groups are only one row with
*
It assume that the first row in the dataframe is without *
.
Then do value_counts
for every group and combine the results into a dataframe. cumsum()
the counts and drop even rows will get the right counts.
I don't know how the clicked
column is calculated. Can you explain this in detail?