A colleague of mine is having an interesting situation, he has quite a large set of possibilities for a defined categorical feature (+/- 300 different values)

The usual data science approach would be to perform a One-Hot Encoding. However, wouldn't it be a bit extreme to perform some One-Hot Encoding with a dictionary quite large (+/- 300 values) ? Is there any best practice on when to choose Embedding vectors or One-Hot Encoding ?


Additional, information: how would you handle the previous case if new values can be added to the dictionary. Re-training seems the only solution, however with One-Hot Encoding, the data dimension will simultaniously increase which may lead to additional troubles, embedding vectors, on the opposite side, can keep the same dimension even if new values appears.

How would you handle such a case ? Embedding vectors clearly seem more appropriate to me, however I would like to validate my opinion and check if there is another solution that could be more apporiate.

没有正确的解决方案

许可以下: CC-BY-SA归因
scroll top