Pergunta

I am new on Machine Learning and building models but a lot of tutorials has given me the chance to learn more about this topic. I am trying to build a predictive model for detecting fake news. The percentage of data with labels 1 e 0 is the following:

       T
0    2015
1     798

It is not well balanced, unfortunately, as you can see. I split the dataset as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)

i.e. 70% train and 30% test. I hope it makes sense, though I have unbalanced classes. Then, after cleaning text by removing stopwords and punctuation (should I have done something else?), I ran different models, specifically MultiNaive Bayes, SVM and Logistic Regression, getting the following results:

MNB : 84%

  precision    recall  f1-score   support

           0       0.88      0.90      0.89       476
           1       0.45      0.40      0.42        95

    accuracy                           0.82       571
   macro avg       0.66      0.65      0.66       571
weighted avg       0.81      0.82      0.81       571

SVM: Accuracy: 0.8336252189141856

Precision: 0.5 Recall: 0.2736842105263158 (Terrible results!)

Logistic regression: 0.8546409807355516

All the tutorial show that the steps for building a good model when you have some text, are removing stopwords and punctuation and extra words. I have done all these things, but probably there will be something that I could do more to improve the results. I read that, in general, who gets results above 99% met problems like overfitting: however, I would really have liked to get a 92% (at least). What do you think? How could I improve further the models? Do you think that having unbalanced classes could have affected the results?

Any suggestions would be greatly appreciated it.

Foi útil?

Solução

A few ideas:

  • As mentioned by @weareglenn in general there is no way to know if the performance obtained on some data is good or bad, unless we know the performance of other systems which have been applied to the same task and dataset. So yes, your results are "acceptable" (at least it does the minimum job of beating the random baseline). However given that that your approach is quite basic (no offense!), its reasonably likely that the performance could be improved. but that's just an educated guess, and there's no way to know by how much it can be improved.
  • To me the level of imbalance is not that bad. Given the low recall on the minority class (fake news) you could try to oversample it if you want to increase recall, but be aware that this is likely to decrease precision (i.e. increase False Positive errors = class 0 predicted as 1). In my opinion you don't have to, unless for your task you must minimize False Negative errors.
  • You could try a lot of things with the features, and I'm quite confident that there is room for improvement at this level:
    • First as mentioned by @weareglenn you should try without removing punctuation, maybe even without removing stop words.
    • Then you could play with the frequency: very often excluding the words with a low frequency in the global training vocabulary allows the model to generalize better (i.e. it avoids overfitting). Try with different minimum frequency threshold: 2,3,4,... (depends how large is your data).
    • More advanced: use feature selection, preferably with a method such as genetic learning, but it might take time because it will redo the training+test process many times. Individual feature selection (e.g. with information gain or conditional entropy) might work, but it's rarely very good.
    • If you want to go very advanced, you could even borrow methods from automatic stylometry, i.e. methods used to identify the style of a document/author (the PAN shared tasks is a good source of data/systems). Some use quite complex methods and features which could be relevant for identifying fake news. A simple thing I like to try is to use characters n-grams as features, it's sometimes surprisingly effective. You could also imagine using more advanced linguistic features: lemmas, Part-Of-Speech (POS) tags.
  • You didn't mention Decision Trees in your methods, I would definitely give it a try (random forests for the ensemble method version).

Outras dicas

If you have a lot of data - down-sample your negative class to achieve 50/50 split on your fake news/real news classification. If you don't have much data - you can use techniques like SMOTE to up-sample the lesser class.

You seem to have better accuracy than randomly choosing fake/real which is a good sign. Your probability of a negative class based on your data split is 71.6% - and you are able to achieve 85.4% accuracy with LogReg. Don't get too down on that (especially if you are new to ML).

I would recommend checking out Gradient Boosting or Bagging algos if this is an NLP problem - these usually yield the best results for me when I'm encountered with sparse text data in classification.

As for the punctuation and stop words this is a common first step - however it's not good general advice for any problem. Do you think the presence of exclamation points might weed out some fake news in your data? If so I would include punctuation. If not - you're probably already on the right track. Removing stop words and punc only makes sense if the context of your specific problem calls for it.

More generally - your desire to reach 92% accuracy might not be possible given the difficulty of your problem. This is not to say it's not possible but keep in mind that the tutorials you may follow online are pre-determined to show that you can get good results. Some projects are simply harder than others (and some are not even possible given the context).

Good luck!

In an Imbalanced dataset, we don't look at the accuracy as a whole.
Either check the Precision/Recall ratio Or individual classes accuracy.

With that, I believe your 85% accuracy is not of much use.
Individual recall are -
Class_0 - 0.90
Class_1
- $\color{red}{0.40}$
It implies, 60 out of 100 fake news is missed

Also, support of 95 and 471 is equivalent to 20% of total data and that also not stratified on y. Not sure why is this when split is 30% and stratified.

It means, the model is not able to learn probably because of Class Imbalance. Though 798:2015 is not too Imbalanced.

Please follow the strategy to handle Imbalanced dataset e.g. Undersampling, Over-sampling, Using appropriate metrics etc. [Check internet/SE for that]

Yes, having unbalanced classes will affect your results. Besides the data augmentation techniques suggested above, you could also consider using Optuna with a risk-based performance score that accounts for how undesirable false negatives are relative to false positives.

This was the motivation for my master's thesis and I would love to see it implemented somewhere. Even using ROC Area Under the Curve (AUC) is not as meaningful as risk; see the last link at the bottom of this answer for an illustrative figure.

Licenciado em: CC-BY-SA com atribuição
scroll top