What are the approaches to aggregate categorical variables?

https://datascience.stackexchange.com/questions/46780

01-11-2019
|

Question

I am working on a clickstream dataset. I have come up with the following example dataset to explain my problem:

ClickTimeStamp        | SessionID | ART_weekOfYear | PagenameClicked | TimeSpentPerSession | CustID | ContractID | ... | TARGET |
2017-01-04 16:48:00   | 1         | 1              | P1              | 1                   | abc    | xyz        |     | 1      |
2017-01-04 16:48:53   | 1         | 1              | P2              | 1                   | abc    | xyz        |     | 1      |
2017-01-11 10:09:57   | 2         | 2              | P1              | 2                   | abc    | xyz        |     | 1      |
2017-01-11 10:11:24   | 2         | 2              | P4              | 2                   | abc    | xyz        |     | 1      |
2017-01-27 13:22:39   | 3         | 4              | P1              | 2                   | abc    | mnp        |     | 0      |
2017-01-27 13:24:01   | 3         | 4              | P7              | 2                   | abc    | mnp        |     | 0      |

The above dataset has clicks on its each row and TARGET is (let's say) contract was retained (1) or not (0). Keep in mind the TARGET is at contract level.

Now, I aggregate the above dataset as per my need (i.e. aggregate on contractID) and training set looks like this:

CustID | ContractID | ... | SessionID_conct | ART_weekOfYear_conct | PagenameClicked  | TimeSpentPerSession_avg | TARGET | 
abc    | xyz        |     | "1-2"           |"1-2"                 | "P1->P2->P1->P4" | 1.5                     | 1      |
abc    | mnp        |     | "3"             |"4"                   | "P1->P7"         | 2                       | 0      |

PROBLEM: For numerical features I just took average (as for TimeSpentPerSession_avg) but for categorical features it is not straightforward. In reality, my categorical features have very high cardinality, such as "PagenameClicked". So I cannot simply convert my categorical features to dummy variables and then aggregate them as numerical features.

I would like to know possible solutions to treat categorical features in such a way that dimensionality doesn't explode and I can also aggregate new representation on the contractID.

I have tried Entity Embeddings and read this paper for details. I transformed each categorical feature to an embedding representation of 16 dimension. However, now I am stuck at aggregating these embedding vectors for each contractID. Kindly let me know if anyone has worked in this direction or has a better solution.

Thanks allot for reading this question. :)

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange