Question

I have used the stanford movie review dataset for creating a experimentation of sentiment analysis.

Managed to create a basic application on top of Spark using the Naive bayes classification algorithm.

Steps that I did for pre-processing from the spark ML pipeline

  • Tokenization
  • Bigrams

The provided dataset above also has a testing dataset with itself which is separate of the training set. After training it I got around 97% accuracy which I believe is pretty good for Naive bayes.

Now can I use this ML model to predict for other texts such as email/chat etc., My guess is that this dataset has a large enough collection of words to perform good predictions and certain english words regardless of the business context like "I dont like this","This does not look good" is the same across different domains such as Movies/Emails/Chats etc.

I have not done the experiment since the data that I need to get hold of belongs to the customer and due to privacy restrictions we cannot access the data.

Any help/guidance would be much appreciated.

Was it helpful?

Solution

It depends.

You're basically asking if your sample (training data) is representative of the population (all written words).

  1. Are you doing sentiment analysis on movie reviews? It'll work great.
  2. Are you doing sentiment analysis on TV reviews? It'll probably work great.
  3. Are you doing sentiment analysis on book reviews? I would give better than 50-50 odds it'll work.
  4. Are you doing sentiment analysis on Twitter posts? Now we're getting shaky. People tend to write much less, use less formal language, and use more emojis which your movie review model wouldn't have seen.

That being said, there are definitely "generic" sentiment analysis services like here. Try out your model against Algorithmia on what you would consider a generic set of data (e.g. a bunch of tweets) and see how it does.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top