document classification using naive bayes in python

https://stackoverflow.com/questions/10515907

06-06-2021
|

Question

I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?

Solution

A few points that might help:

Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.

You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!

OTHER TIPS

There could be many reasons for the classifier not working, and there are many ways to tweak it.

did you train it with enough positive and negative examples?
how did you train the classifier? did you give it every word as a feature, or did you also add more features for it to train on(like length of the text for example)?
what exactly are you trying to classify? does the specified classification have specific words that are related to it?

So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.

If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since your using tf-idf) rather than the set of all boolean values.

If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that

tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)

rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.

This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.

You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.

For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.

Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.

If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.

There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.

Here is some literature on NB and the how both the Multinomial and Bernoulli models work: http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow