Training data size for a bayesian classifier
-
14-11-2019 - |
Domanda
I am using apache mahout for performing sentiment analysis in the customer support domain. Since I am not able to get a proper training data set, I made my own. Now I have 100 support mails for positive sentiment and 100 for negative.
But the problem is, I am not able to achieve accuracy. It stays somewhere around 55%, which is pathetic. Some 70% and around accuracy will be satisfactory. And also note that I am using a complimentary naive bayes classifier of apache mahout.
Coming to the question precisely, is it the smaller data set size that is bringing down the accuracy? If not, where should I tweak?
Soluzione
Only for the benefit of those looking into this question in future, I will share the ways in which I tweaked the accuracy of my classifier from 50 to around 78%
- Perform stemming on training and input data
- Perform stop word removal on training and input data
- Convert training and input data to lower case (or uppercase)
- Have near equal amount of samples in each category of the training data
- Fine tune the ngram level according to your domain.
This should dramatically raise your accuracy.