For very simple linear regression can we quantify the prediction accuracy hit between using one hot encoding and simple numerical mapping?

https://datascience.stackexchange.com/questions/86435

17-12-2020
|

Question

Suppose I had a simple linear regression model that had the following input or X variable:

[North]
[East]
[West]
[South]
[North, East]
...
[North, East, West, South]

and I decided to numerate them like:

[North] - 0
[East] - 1
[West] - 2
[South] - 3
[North, East] - 4
...
[North, East, West, South] - 15

I had someone take a look at my model and tell me to use One Hot Encoder or One Hot Binary Encoder instead of assigning inputs like this.

My question is from a linear regression perspective what is the advantage for using OHE over my simple numerical mapping? If we can quantify an accuracy loss would it be substantial? If I had 10 model variables that I had to map like this would the loss be more substantial?

I want to know what sacrifice I'm making not using OHE

Solution

The issue with numerical encoding in this context is you are enforcing that your input variable X is ordinal when it's likely not. This is telling your model that the order in which you encode your inputs are either increasing or decreasing monotonically with your target. Let's say you encoded your data like this:

[North] - 0
[East] - 1
[West] - 2
[South] - 3
[North, East] - 4
...
[North, East, West, South] - 15

If you train a linear regression model with this encoding you are telling your model that [North] either indicates a higher or lower target than [North, East, West, South], and that [East], [West], [South], and [North, East] are somewhere in between. What if [West] typically has a lower target than either [North] or [North, East, West, South]? In this case you would be enforcing some constraint on your model which is not true.

To test this you could split your input data in a 70/30 train-test-split and evaluate how your numerical encoding performs against one-hot-encoding on the test set. Another alternative would be to look into Target Encoding - this would allow you to keep a small feature space (an issue with one-hot-encoding) while attempting to keep your input and target increasing/decreasing monotonically. I would recommend testing all three encoding methods to figure out the ideal solution for your given problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange