Tree-based algorithms and ordinal features

https://datascience.stackexchange.com/questions/84065

14-12-2020
|

Question

For tree-based methods (e.g., DT, Random Forest, Gradient boosting, etc.), does the conversion interval of an ordinal feature to continuous matter matters? (I can see why it matters for linear model, but I am not clear for tree-based methods)

For example: Is there a difference between converting an ordinal feature from [‘Low’,’Medium’,’High’] to [1,2,3] compared to converting it to [1,99, 876]

Solution

Edit: I misread your post: the answer is no it shouldn't matter which interval is used. So long as the order is not changed the splits your tree finds on this data will effectively be the same.

See my original answer if you want more context:

In scikit-learn specifically (I cannot speak for other implementations of tree-based models) it does not accept categorical data as input. Meaning the user must convert this to real numbers using hot encoding, numerical encoding etc...

When fitting on any given tree - when it's iterating through possible splits it is splitting your column using an inequality (ie is this column's value bigger or smaller than this threshold value). Due to this methodology it does matter in which order your data is encoded.

If you hot-encode your data - each column will be looked at independently (a likely split would be "is this bigger or less than 0.5 as thats effectively the only split to be made when the options are either 0 or 1).

If you numerically encode your data it will just pick the best threshold that maximizes your split criterion function (usually gini inequality). The bias you add to your system by manually determining the order or the numerical encoding can affect how well the tree is able to split on this column. If you do some logical encoding whereby the target increases in value as the value of your feature increases, your tree will be able to split on this column more effectively & you will likely see better results.

For this reason I recommend numerically encoding with a TargetEncoder (https://contrib.scikit-learn.org/category_encoders/targetencoder.html). This way the order of your numerical encoder will make sense for the decision tree.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange