Question

Dataset

I am using this dataset for the analysis (Generated using make_regression of sklearn library)

I was trying to learn the DecisionTreeRegression algorithm of sklearn library. I used the following code to fit the regressor.

from sklearn.tree import DecisionTreeRegressor as DTR

regressor1 = DTR(max_depth=2)
regressor1.fit(X,y)

y_pred1 = regressor1.predict(X)

These are the leaf node values that I got,

Leaf Values

It seems like, the decision tree first did a split on prop 2 at -1.0923644932716892 for the root node then on the right child of the root it again did another split on prop 2 at 0.0340153523120978.

But what I learnt about Decision Trees is that is a split is done at a property then in the same branch the property should not be used again. Then why the sklearn library is doing this thing?

Was it helpful?

Solution

Good job looking at the tree and understanding what has happened.

There is no problem splitting on the same feature multiple times. A continuous feature has many split points available. The tree continues to subset and refine. The split criteria shows what will be the "best" greedy split at this point. If a feature is income, perhaps the best split is \$100,000. Then on the high side, there is another split for \$10,000,000 since those people behave differently from the \$100,100 income people.

Even a categorical variable may split again. For example black and blonde hair go left, all other hair color go right. Later splitting black and blonde is the best split available.

I saw a research project where scikit learn code was adjusted to give split priority to features that were already used to reduce the number of features in the model which may happen in trees on spurious interactions due to greediness. It worked well.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top