Ways to improve the accuracy of a Naive Bayes Classifier?

https://stackoverflow.com/questions/3473612

28-09-2019
|

Question

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.

I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?

Solution

In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).

so when you want to improve classifier prediction, you can look in several places:

tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg, ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data, improve your basic parsing, or refine the features you select from the data.

w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.

I. Data Parsing (pre-processing)

i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.

stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for instance, if you have the terms programmer, program, progamming, programmed in a given data point, a stemmer will reduce them to a single stem (probably program) so your term vector for that data point will have a value of 4 for the feature program, which is probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer, coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features

II. Feature Selection

consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.

III. Specific Classifier Optimizations

Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.

The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me, i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.

OTHER TIPS

I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.

Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.

I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.

Using Laplacian Correction along with AdaBoost.

In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.

Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.

keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,

Select features which have less correlation between them. And try using different combination of features at a time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow