Вопрос

I am trying to build a model that predicts if an email is spam/not-spam. After building a logistic regression model, I have got the following results:

          precision    recall  f1-score   support

         0.0       0.92      0.99      0.95       585
         1.0       0.76      0.35      0.48        74

    accuracy                           0.92       659
   macro avg       0.84      0.67      0.72       659
weighted avg       0.91      0.92      0.90       659

Confusion Matrix: 
 [[577   8]
 [ 48  26]]

Accuracy:  0.9150227617602428

The F1-score is the metric I am looking at. I am having difficulties in explaining the results: I think are very bad results! May I ask you how I could improve it? I am currently considering a model that looks at corpus of the emails (subject + corpus).

After Erwan's answer:

I oversampled the dataset and these are my results:

Logistic regression
              precision    recall  f1-score   support

         0.0       0.94      0.77      0.85       573
         1.0       0.81      0.96      0.88       598

    accuracy                           0.86      1171
   macro avg       0.88      0.86      0.86      1171
weighted avg       0.88      0.86      0.86      1171

Random Forest
              precision    recall  f1-score   support

         0.0       0.97      0.54      0.69       573
         1.0       0.69      0.98      0.81       598

    accuracy                           0.77      1171
   macro avg       0.83      0.76      0.75      1171
weighted avg       0.83      0.77      0.75      1171
Это было полезно?

Решение

In your results you can observe the usual problem with imbalanced data: the classifier favors the majority class 0 (I assume this is class "ham"). In other words it tends to assign "ham" to instances which are actually "spam" (false negative errors). You can think of it like this: with the "easy" instances, the classifier gives the correct class, but for the instances which are difficult (the classifier "doesn't know") it chooses the majority class because it's the most likely.

There are many things you could do:

  • Undersampling the majority class or oversampling the minority class is the easy way to deal with class imbalance.
  • Better feature engineering is more work but it's often how to get the best improvement. For example I guess that you use all the words in the emails as features right? So you probably have too many features and that probably causes overfitting, try reducing dimensionality by removing rare words.
  • Try different models, for instance Naive Bayes or Decision Trees. Btw Decision Trees are a good way to investigate what happens inside the model.
Лицензировано под: CC-BY-SA с атрибуция
Не связан с datascience.stackexchange
scroll top