Word Embedding or Hash?

https://datascience.stackexchange.com/questions/67145

08-12-2020
|

Pergunta

In my dataset I have a 'text' column and a 'followers' column containing lists of follower IDs, i.e. '1093777852477116417, 936194589043683328,...'. Some of the 'followers' values contain thousands of IDs.

I am preprocessing the data for LSTM, and I will do word embedding on the text column.

My question is, should I add the follower IDs to the word embedding of the text column, or should I hash the follower IDs and add an extra LSTM input layer for the IDs?

Thanks in advance!

Solução

It depends…

The general rule of thumb is that there should be at least 40 occurrences of an item to train an embedding model to find a robust representation. If most follower IDs repeat then an embedding model can learn which ones co-occur. If follower IDs are sparse then hashing (which randomly assigns numbers) is a better choice.

Which method is better is an empirical question. You can create both models, benchmark, and then choose the data processing pipeline that is best for your task.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange