Domanda

I am using apache mahout for performing sentiment analysis in the customer support domain. Since I am not able to get a proper training data set, I made my own. Now I have 100 support mails for positive sentiment and 100 for negative.

But the problem is, I am not able to achieve accuracy. It stays somewhere around 55%, which is pathetic. Some 70% and around accuracy will be satisfactory. And also note that I am using a complimentary naive bayes classifier of apache mahout.

Coming to the question precisely, is it the smaller data set size that is bringing down the accuracy? If not, where should I tweak?

È stato utile?

Soluzione

Only for the benefit of those looking into this question in future, I will share the ways in which I tweaked the accuracy of my classifier from 50 to around 78%

  • Perform stemming on training and input data
  • Perform stop word removal on training and input data
  • Convert training and input data to lower case (or uppercase)
  • Have near equal amount of samples in each category of the training data
  • Fine tune the ngram level according to your domain.

This should dramatically raise your accuracy.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top