label encoding or one-hot encoding or none when using decision tree?

https://datascience.stackexchange.com/questions/85952

16-12-2020
|

Question

I've been learning about decision tree from multiple resources but still not fully understanding data preprocessing step.

from https://www.youtube.com/watch?v=PHxYNGo8NcI&t=535s&ab_channel=codebasics it uses decision tree with label encoder and in another resource it says we don't need to convert categories to strings, I'm confused.

Given I have data that looks like

   gender       level        score
    male          1           34
    female        2           77
    female        1           44

If we are using label encoder we would only need to convert gender however if that maps male = 0, female = 1 wouldn't the machine treat female > male? and if it ignores ordinality it will ignore level1 < level2 and treat as if level 1 and level 2 are same level which is not true.

What is the right preprocessing step and why?

Solution

If we are using label encoder we would only need to convert gender however if that maps male = 0, female = 1 wouldn't the machine treat female > male?

You are correct, using label encoder to encode categorical features is wrong in general, for the reason you mention. Note that scikit documentation advises against using it with features, it's supposed to be used only with a response variable.

In the particular case of a binary variable like "gender" to be used in decision trees, it actually does not matter to use label encoder because the only thing the decision tree algorithm can do is to split the variable into two values: whether the condition is gender > 0.5 or gender == female would give the exact same results.

Also note that whether the variable is interpreted as ordinal or not is a matter of implementation. For example in Weka it's possible to specify that a feature is categorical ("nominal").

and if it ignores ordinality it will ignore level1 < level2.

Not necessarily, because in theory it's possible to have features with different types (e.g. some categorical and some numerical). However this may depend on the implementation as well.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange