Question

When ML algorithms, e.g. Vowpal Wabbit or some of the factorization machines winning click through rate competitions (Kaggle), mention that features are 'hashed', what does that actually mean for the model? Lets say there is a variable that represents the ID of an internet add, which takes on values such as '236BG231'. Then I understand that this feature is hashed to a random integer. But, my question is:

  • Is the integer now used in the model, as an integer (numeric) OR
  • is the hashed value actually still treated like a categorical variable and one-hot-encoded? Thus the hashing trick is just to save space somehow with large data?
Was it helpful?

Solution

The a second bullet is the value in feature hashing. Hashing and one hot encoding to sparse data saves space. Depending on the hash algo you can have varying degrees of collisions which acts as a kind of dimensionality reduction.

Also, in the specific case of Kaggle feature hashing and one hot encoding help with feature expansion/engineering by taking all possible tuples (usually just second order but sometimes third) of features that are then hashed with collisions that explicitly create interactions that are often predictive whereas the individual features are not.

In most cases this technique combined with feature selection and elastic net regularization in LR acts very similar to a one hidden layer NN so it performs quite well in competitions.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top