Question

There's a hotel -review dataset having 1500 each of positive and negative files. To determine the accuracy of my algorithm, I have to first check the percentage positivity or negativity of the original file in the hotel-review dataset.

I tried the basic percentage criterion:

positivity % = no. of positive words/ (Total positive + total neg words)

But this holds no significant ground, so can't work on this. Is there any other method or ground on which I can work?

Example-> (She's the most beautiful lady I've ever seen.) should get a better positivity percentage than (She is a nice lady.)

I'm doing the work in Python.

No correct solution

OTHER TIPS

The first thing you can try is switching from a binary category for words (positive vs. negative), to a sliding scale. The SentiWordNet project provides this.

However on your specific example this could actually make things worse. E.g. nice gives P = 0.875. Whereas beautiful only gets P = 0.75. Of course you could fix the SentiWordNet ratings if you disagree, but I'd suggest doing that kind of tuning using an automatic system, with as much domain-specific training data as you can find.

BTW, there are at least a couple of Python interfaces to SentiWordNet.

Going back to your example, the key difference is the "the most [SOMETHING] I've ever seen" structure. This requires switching from a bag of words approach to actually parsing and understanding the sentence. I have no useful leads to give you there, so I'll be as delighted as you if someone says there is a ready-made open-source package already doing that :-)

I'd also like to mention the importance of context. Without any context "She's a beautiful lady" and "She is a nice lady" are both simple and positive. But in the context of the hotel reviews, and their relevance to me, maybe "nice" is more useful than "beautiful" And, for fun, compare these two:

  • "The receptionist was a nice lady."

  • "At breakfast, at a table near to me, was the most beautiful lady I've ever seen. It was a welcome distraction from the food."

That is the challenge I love about sentiment analysis; the commercial applications are just excuses to work on problems like that!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top