Training data size for a bayesian classifier

https://stackoverflow.com/questions/8997597

14-11-2019
|

Domanda

I am using apache mahout for performing sentiment analysis in the customer support domain. Since I am not able to get a proper training data set, I made my own. Now I have 100 support mails for positive sentiment and 100 for negative.

But the problem is, I am not able to achieve accuracy. It stays somewhere around 55%, which is pathetic. Some 70% and around accuracy will be satisfactory. And also note that I am using a complimentary naive bayes classifier of apache mahout.

Coming to the question precisely, is it the smaller data set size that is bringing down the accuracy? If not, where should I tweak?

Soluzione

Only for the benefit of those looking into this question in future, I will share the ways in which I tweaked the accuracy of my classifier from 50 to around 78%

Perform stemming on training and input data
Perform stop word removal on training and input data
Convert training and input data to lower case (or uppercase)
Have near equal amount of samples in each category of the training data
Fine tune the ngram level according to your domain.

This should dramatically raise your accuracy.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow