문제

I have a binary classification task with imbalance between the two classes. I want to compare SMOTE vs down sizing the majority class to the size of the minority class.

I trained the classifier with 3-fold validation using the two methodologies:

  • SMOTE to increase the size of the minority class to the majority class size
  • Downsizing the majority class to the minority class size with a random subsampling

To test which methodology works better I trained my classifier (Random Forests) with a 3-fold Cross-Validation.

The confusion matrices I get from 3-fold CV seems to promote the use of SMOTE (better classification performance for the two classes). I assume that this CV can be used to choose the best methodology.

However, when I test the classifier on a real testing set (which was kept out and not used for training or validation) I don't see a real superiority of the SMOTE algorithm w.r.t. random subsample of the majority class. The minority class is better classified, but at the expense of the majority class performance.

Is this a limitation of SMOTE algorithm or my model selection methodology (using 3-fold CV) has some flaws?

도움이 되었습니까?

해결책

It's difficult to say without the actual data.

However, I can tell you that SMOTE creates artificial instances, hence, when used in much expanse can "deviate" from the actual minority class data. It's difficult to determine the expanse. Many factors take place, firstly the Data, then the neighbouring coefficients.

P.S. You could try boosting using many random under Samples. Hence, instead of Random Forest you could try first Adaboost for instance were each classifier is trained on a different sub sample.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top