Does converting continuous variable to discrete(categorical) variable increases accuracy of a tree based model?

https://datascience.stackexchange.com/questions/75077

11-12-2020
|

Question

I've read other questions regarding if a continuous feature should be converted to categorical or not. But I'm interested in case of tree based classifiers such as Decision Tree, Random Forest, Gradient Boosted etc.

My intuition is that since tree based classifiers try to find the optimal split or the best test at each node, providing a categorical feature would make the splits more accurate than a providing a continuous feature.

My question is, doing the aforementioned pre-processing of data will lead to high accuracy in case of tree based models or the opposite? or it depends on the data?

Solution

I actually agreed with the now-deleted answer: you're unlikely to see improvement (at least with any consistency).

As you point out, binning a continuous variable just reduces the allowable split points for a tree model. This reduces the capacity of the overall model. That may be beneficial or detrimental, depending on how it affects the bias-variance tradeoff: I would expect it to help a lot on a fully-grown single decision tree, but probably hurt a little in a well-tuned ensemble of trees.

There are two nice examples in the sklearn docs illustrating their binning preprocessor:
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html

Extending the second one to (1) also include a full decision tree and (2) add binning-then-nonlinear-models produces the following:

At least in these simplistic examples, my suggestions above hold.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange