How to merge all the data to have a final dataset [closed]
-
16-12-2020 - |
Pergunta
I am working with a problem that has different tables, end goal to predict if a customer will end up subscribing based on purchases.
Mother table containing user_id,register_reason
|user_id|reason_reg|source|
|-------|----------|------|
| 1 | 2 | 3 |
| 2 | 3 | 1 |
I have then the purchase data where a customer can have more than one entry
|user_id|product_id|
|-------|----------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | E |
Ideally I want to have one dataset where the uniqueidentifier is the user_id and there are no duplicated rows based on this value:
The final dataset (in my head), could look like
|user_id|reason_reg|source|product_id_A|product_id_B|product_id_C|product_id_D|product_id_E|
|------------------------------------------------------------------------------------------|
| 1 | 2 | 3 | 1 | 1 | 1 | 1 | 0 |
| 2 | 3 | 1 | 1 | 0 | 0 | 0 | 1 |
My questions are:
Is the approach correct?
Is there a dataframe or library that does that automatically? or I do that myself by panda before feeding it to a algorithm.
In your opinion, is there a better way to approach the problem(I could add also add aditional columns like total_products with the sum of how many products the user bought)
Solução
A slightly hacky way to get there maybe but you can do this to get what you want from the second table;
df2['count'] = 1
pivot = df.pivot_table(df, index='userid', columns='productid', values = 'count').reset_index()
pivot = pivot.fillna(0)
You would then want to merge this to the first dataset like this;
finaldf = pd.merge(df1, pivot, left_on='userid', right_on='userid')
another great thing to use for generating the dummies for categorical variables is
pd.get_dummies()
The approach seems ok to me and making some more features would also not be a bad idea.