Random Forest Stacking Experiment for Imbalanced Data-set Problem

https://datascience.stackexchange.com/questions/72259

10-12-2020
|

Question

In order to solve a Imbalanced Dataset Problem, I experimented with Random Forest in the given manner (Somewhat inspired by Deep-Learning)

Trained a Random Forest which will take in the input data and the predict probability of the label of the trained model will be used as a input to train another Random Forest.

Pseudo Code for this :

train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2)
rf_model = RandomForestClassifier()
rf_model.fit(train_X, train_y)
pred = rf_model.predict(test_X)
print('******************RANDOM FOREST CM*******************************')
print(confusion_matrix(test_y, pred))
print('******************************************************************')
predict_prob = rf_model.predict_proba(X)


X['first_level_0'] = predict_prob[:, :1].reshape(1,-1)[0]
X['first_level_1'] = predict_prob[:, 1:].reshape(1,-1)[0]

train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2)
rf_model = RandomForestClassifier()
rf_model.fit(train_X, train_y)
pred = rf_model.predict(test_X)

print('******************RANDOM FOREST 2 CM*******************************')
print(confusion_matrix(test_y, pred))
print('******************************************************************')

And I was able to see considerable improvement in the recall. Is this approach mathematically sound. I used the second layer of the Random Forest such that it would be able to correct the error by the first layer. Just looking to combine the principle of boosting to Random Forest Bagging Technique.Looking for thoughts.

Solution

The underlying idea is fine, but you've fallen into a common data leakage trap. By recombining the data and then resplitting, your second model's test set includes some of the first model's training set. The first model knows the labels on those datapoints and, especially if overfit, passes along that information in its predictions. So the score you see for the ensemble is probably optimistically biased.

The most common approach to fixing this is to use k-fold cross-validation to produce out-of-fold predictions on the entire training dataset for the second model.

Note that sklearn now has such stacked ensembles builtin:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange