Question

I was just playing around with one-hot representations of features and thought of the following:

Say that we're having 4 categories for a given feature (e.g. fruit) {Apple, Orange, Pear, Melon}. In this case the one-hot encoding would yield:

Apple:  [1 0 0 0]
Orange: [0 1 0 0]
Pear:   [0 0 1 0]
Melon:  [0 0 0 1]

The above means that we quadruple the feature space as we go from having one feature to having four.

This looks like it's wasting a few bits, as we can represent 4 values with $\log_{2}4=2$ bits/features:

Apple:  [0 0]
Orange: [0 1]
Pear:   [1 0]
Melon:  [1 1]

Would there be a problem with this representation in any of the most common machine learning models?

Was it helpful?

Solution

Good idea but...

You encode not just to transform from categorical to numerical features but to give that information to your model.

Let's say that you do that and feed it through a linear model to try to predict the price. Let's say Pear is really expensive(500€) and Melon cheap (1€).

Your model coefficients with one hot encoding will be:

$price = 500 * Pear[0,1] + 1 * Melon[0,1]$

If you do your encoding the linear combination won't work. What will be the coefficients?

One could argue that with decision trees this won't happen since it can make splits... but it would have to make two splits before determining if it is a melon (a greedy decision tree won´t do it) so again you will lose computational power here too.

You could try to run the experiments and see if this is your result. In the end, this should be science and one can do experiments to prove a hypothesis.

On the other hand, Yes, One Hot Encoding increases the computational time and memory easily since you are creating a lot of features from one and if it has high cardinality you can end up with a lot of features.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top