Sampling in Text Classification: can the results be considered 'reliable'?

https://datascience.stackexchange.com/questions/80026

13-12-2020
|

문제

I am testing different models (SVM, Logistic Regression, Naive Bayes, Random Forest) for predicting the class of a spam email. My target is a binary variable. I am analysing only text, no other fields. My dataset includes

Label  
0.0    3333
1.0     768

As you can see there is a big problem with classes imbalanced. I read about the use of downsampling and upsampling, so I applied them before training and testing the dataset. I got good results in terms of F1, recall and accuracy for upsampling (above 88%; max 97%), bad for downsampling (<=76%). For instance:

Down
              precision    recall  f1-score   support

         0.0       0.79      0.43      0.56       102
         1.0       0.61      0.87      0.76       114


Confusion Matrix: 
 [[ 49  60]
 [ 12 100]]


Up
              precision    recall  f1-score   support

         0.0       1.00      0.85      0.91       873
         1.0       0.87      1.00      0.94       884



Confusion Matrix: 
 [[772 141]
 [  20 822]]

I would like to ask you if these values can be considered good results or they can't. I am considering a publication (not only to include similar analysis), so I would like to check if such results can be considered reliable, despite of the imbalance.

Any suggestions and advice will be greatly welcome.

해결책

There seems to be a mistake in your method:

I read about the use of downsampling and upsampling, so I applied them before training and testing the dataset.

It's incorrect to change the distribution of the test set. When resampling, the resampling should be applied only on the training set. The goal is to force the model to take into account the two classes, because in case of imbalance the model tends to focus on the majority class. However the true proportion of the class in the "real dataset" is still the same, and the test set should follow this true proportion. Otherwise the performance looks artificially good on the test set, even though the classifier will make more mistakes with real data since it doesn't have the same distribution.

So the performance values that you obtain on a resampled dataset are meaningless, I'm afraid.

I am considering a publication (not only to include similar analysis), so I would like to check if such results can be considered reliable, despite of the imbalance.

If you are considering a peer-reviewed publication, you must also make sure that your contribution is original (i.e. new) and brings some advantage over the existing methods. This means that you need to know the state of the art in spam classification (there are a lot of papers already published about this task) and show what your method improves something compared to the existing methods. Ideally this is done by proving that your new method obtains better performance than the state of the arts methods using a benchmark dataset. But it's usually hard to beat state of the art performance on a well-known problem.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange