Question

I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure: - hopefully with wordlists - hopefully trainable on my set of data

Thanks!

Was it helpful?

Solution

What you are looking for is commonly dubbed Sentiment Analysis. Typically, sentiment analysis is not able to handle delicate subtleties, like sarcasm or irony, but it fares pretty well if you throw a large set of data at it.

Sentiment analysis usually needs quite a bit of pre-processing. At least tokenization, sentence boundary detection and part-of-speech tagging. Sometimes, syntactic parsing can be important. Doing it properly is an entire branch of research in computational linguistics, and I wouldn't advise you with coming up with your own solution unless you take your time to study the field first.

OpenNLP has some tools to aid sentiment analysis, but if you want something more serious, you should look into the LingPipe toolkit. It has some built-in SA-functionality and a nice tutorial. And you can train it on your own set of data, but don't think that it is entirely trivial :-).

Googling for the term will probably also give you some resources to work with. If you have any more specific question, just ask, I'm watching the nlp-tag closely ;-)

OTHER TIPS

Some approaches to sentiment analysis use strategies popular on other text classification tasks. The most common being transforming your film review into a word vector, and feeding it into a classifier algorithm as training data. Most popular data mining packages can help you here. You could have a look at this tutorial on sentiment classification illustrating how to do an experiment using the open source RapidMiner toolkit.

Incidentally, there is a good data set made available for research purposes related to detecting opinion on film reviews. It is based on IMDB user reviews, and you can check many related research work on the area and how they use the data set.

Its worth bearing in mind that the effectiveness of these methods can only be judged from a statistical viewpoint, so you can pretty much assume there will be misclassifications and cases where opinion is hard to detect. As already noticed in this thread, detecting things like irony and sarcasm can be very difficult indeed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top