Extract features from Decision tree leaf nodes

https://datascience.stackexchange.com/questions/69626

09-12-2020
|

Question

Recently came across a coursera course on "How to win Kaggle competitions" where they explain how we can engineer a categorical feature from each leaf node of the decision tree.

[Video Link][1]

I cannot understand this concept. Any suggestion or pointers towards understanding this will be great.

For example assume the following random training data:

Gender  Age Sample_Ftre

 M      25   1.5
 M      26   1.5
 F      28   1.5
 F      27   1.5
 M      26   1.5

Can anyone explain what will be the value of new engineered_feature from the decision tree and how to calculate it.

Solution

I will describe my personal approach to achieve what is taught in the Course you mentioned, a mind blowing course I think!

To generate DT features for your training set:

First of all, split your training data in k-folds. It is necessary to avoid overfitting. Then:

Hide a fold and build a DT using the (k-1) remaining folds. Control the number of leaf nodes in your DT using the max_leaf_nodes paramenter in sklearn. Set it as the number of categories (levels) you want in your new categorical feature.
Predict output for the held out fold. When you are predicting using a DT, each observation will fall in a leaf. So, the values for your new feature will be the index of the leaf each observation ends up in. You can get this index using the apply command in sklearn.

Repeat these 2 steps for each one of the k folds.

To generate DT features for your test set:

Build a DT using the entire training set. Don't forget to set the max_leaf_nodes parameter as the same used above.
Predict output for the test set. Get the leaf index for each test observation.

Now you have a new categorical feature containing the leaf indexes in the predictions of the DT.

Problems with this procedure

Since you cannot control the sequence of splits in each DT, the indexes found in each CV fold may no match. For example, category "0" found in the first fold may not be same category "0" found in the second fold.
At the same time, it is not recommended (following the advises in the Course) building a DT using the entire training set and then predicting leaf indexes for the training and test sets using the same DT. This may lead to overfitting, because you are introducing target information in your training set.

So, what can we do?

Instead of using the leaf index as the feature value you can count the number of nodes each observation goes through until it falls in a leaf, and use it as the feature value. In single words, it is the path length from the root node to the leaf node. I think it is a more informative value. This is the principle behind Isolation Forest.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange