How to merge all the data to have a final dataset [closed]

https://datascience.stackexchange.com/questions/85692

16-12-2020
|

Pergunta

I am working with a problem that has different tables, end goal to predict if a customer will end up subscribing based on purchases.

Mother table containing user_id,register_reason

|user_id|reason_reg|source|
|-------|----------|------|
|   1   |     2    |  3   |
|   2   |     3    |  1   |

I have then the purchase data where a customer can have more than one entry

|user_id|product_id|
|-------|----------|
|   1   |     A    |
|   1   |     B    |
|   1   |     C    |
|   1   |     D    |
|   2   |     A    |
|   2   |     E    |

Ideally I want to have one dataset where the uniqueidentifier is the user_id and there are no duplicated rows based on this value:

The final dataset (in my head), could look like

|user_id|reason_reg|source|product_id_A|product_id_B|product_id_C|product_id_D|product_id_E|
|------------------------------------------------------------------------------------------|
|   1   |     2    |   3  |      1     |      1     |      1     |      1     |      0     |
|   2   |     3    |   1  |      1     |      0     |      0     |      0     |      1     |

My questions are:

Is the approach correct?
Is there a dataframe or library that does that automatically? or I do that myself by panda before feeding it to a algorithm.
In your opinion, is there a better way to approach the problem(I could add also add aditional columns like total_products with the sum of how many products the user bought)

Solução

A slightly hacky way to get there maybe but you can do this to get what you want from the second table;

df2['count'] = 1
pivot = df.pivot_table(df, index='userid', columns='productid', values = 'count').reset_index()
pivot = pivot.fillna(0)

You would then want to merge this to the first dataset like this;

finaldf = pd.merge(df1, pivot, left_on='userid', right_on='userid')

another great thing to use for generating the dummies for categorical variables is

pd.get_dummies()

The approach seems ok to me and making some more features would also not be a bad idea.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange