How to improve results from ML model? (spam classification)
-
17-12-2020 - |
Pergunta
I am trying to build a model that predicts if an email is spam/not-spam. After building a logistic regression model, I have got the following results:
precision recall f1-score support
0.0 0.92 0.99 0.95 585
1.0 0.76 0.35 0.48 74
accuracy 0.92 659
macro avg 0.84 0.67 0.72 659
weighted avg 0.91 0.92 0.90 659
Confusion Matrix:
[[577 8]
[ 48 26]]
Accuracy: 0.9150227617602428
The F1-score is the metric I am looking at. I am having difficulties in explaining the results: I think are very bad results! May I ask you how I could improve it? I am currently considering a model that looks at corpus of the emails (subject + corpus).
After Erwan's answer:
I oversampled the dataset and these are my results:
Logistic regression
precision recall f1-score support
0.0 0.94 0.77 0.85 573
1.0 0.81 0.96 0.88 598
accuracy 0.86 1171
macro avg 0.88 0.86 0.86 1171
weighted avg 0.88 0.86 0.86 1171
Random Forest
precision recall f1-score support
0.0 0.97 0.54 0.69 573
1.0 0.69 0.98 0.81 598
accuracy 0.77 1171
macro avg 0.83 0.76 0.75 1171
weighted avg 0.83 0.77 0.75 1171
Solução
In your results you can observe the usual problem with imbalanced data: the classifier favors the majority class 0 (I assume this is class "ham"). In other words it tends to assign "ham" to instances which are actually "spam" (false negative errors). You can think of it like this: with the "easy" instances, the classifier gives the correct class, but for the instances which are difficult (the classifier "doesn't know") it chooses the majority class because it's the most likely.
There are many things you could do:
- Undersampling the majority class or oversampling the minority class is the easy way to deal with class imbalance.
- Better feature engineering is more work but it's often how to get the best improvement. For example I guess that you use all the words in the emails as features right? So you probably have too many features and that probably causes overfitting, try reducing dimensionality by removing rare words.
- Try different models, for instance Naive Bayes or Decision Trees. Btw Decision Trees are a good way to investigate what happens inside the model.