Overfitting for minority class after SMOTE w/ random forests

https://datascience.stackexchange.com/questions/11656

16-10-2019
|

سؤال

I used SMOTE to make a predictive model, with class 1 having 1800 samples and 35000+ of class 0 samples. Hence, as per SMOTE, synthetic samples were created and the random forest was trained.

However, I am now getting most results as class 1 when I test my model. I just tried to test it on the training set and this is what I got:

Without SMOTE

With SMOTE

I've also tried hyperparameter optimisation, but that hasn't worked

Thanks

PS: Used SMOTE implementation in pandas with UnbalancedDataset library

المحلول

smote algorithm depends on the data set you have. If you have severe data imbalance, like the one you have in your case smote algorithm may not be able to help if the variations within the minority class is very high and the similarities between the two classes is very high. But How to know if this is the case. Try to duplicate samples from the minority class train a non linear svm and check the results if the classification accuracy is very low then this the case.

Smote use knn to create new samples but if the variation within the minority class is very high then using smote will use samples that are not real neighbors even. To be honest with you , there is no clear solution for this problem but i can suggest the followings: 1. Try borderline smote : it is a modified version of smote algorithm 2. Try smote boosting : it is a modified version of adaboost where adaboost algorithm is augmented with smote 3. If you can modify the smote boost to consider borderline smote instead of smote

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى datascience.stackexchange