Over-sampling: is my model over-fitting?

https://datascience.stackexchange.com/questions/86105

17-12-2020
|

Question

I would like to ask you some questions on how to consider (good or not) the following results:

OVER-SAMPLING
              precision    recall  f1-score   support

         0.0       1.00      0.85      0.92       873
         1.0       0.87      1.00      0.93       884

    accuracy                           0.92      1757
   macro avg       0.93      0.92      0.92      1757
weighted avg       0.93      0.92      0.92      1757

Confusion Matrix: 
 [[742 131]
 [  2 882]]

I have a dataset with 3500 obs (3000 with class 0 and 500 with class 1). I would like to predict class 1 (target variable). Since it is a problem of imbalance classes, I had to consider re-sampling methods. The result shown above is from over-sampling. Do you think it over-fits and/or that it cannot be a good re-sampling method for my case? I am looking at the f1-score column, since it is a text classification problem.

Solution

In order to get accurate results, you should not oversample the test set! Otherwise you are simply evaluating on synthetic samples that you yourself have created. The support on your classification report should mirror the imbalance in your dataset.

From what I understand you have 3500 samples, then you did some oversampling (probably brought them to around 6000) and then took 1757 from these for testing. This evaluation scheme is wrong. Take a look at the illustration below to see a more correct scheme.

      |--- train --> oversample train set --> train model---|
set --|                                                     |--> evaluation on test set
      |--- test --------------------------------------------|

OTHER TIPS

In order to detect overfitting you need to separate your data in a training set - that you use to estimate the parameters of you model - and a test set - where you evaluate your model keeping the parameters fixed, this is usually called cross-validation. I understand from your results that you are not doing such separation of data so, you're not able to detect overfitting.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange