Question

I'm trying to create a sentiment analysis tool to analyse tweets over a three day period about Manchester United football club and determine whether people view them positively or negatively. I am currently using this guide for guidance (with Java being my coding language)

http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

I am using Apache Flume to download my tweets into Apache Hadoop and then am intending to use Apache Hive to query the tweets. I may also use Apache Oozie to partition the tweets effectively.

In the link I posted above, it is mentioned that I need to have a training dataset to train the classifier I will create to analyse the tweets. The sample classifier provided has some 5000 tweets. As I am doing this for a summer project for uni, I feel I should probably create my own dataset.

What is the minimum amount of tweets I should use to make this classifier effective? Is there a recommended number? For example, if I manually analysed a hundred tweets, or five hundred, or a thousand, would it be effective?

Was it helpful?

Solution

There is not a exact number to train a classifier. You can have a large dataset where all the data has the same attributes so you classifier will memorize a pattern, or, you can have a no so big dataset with good instances so you classifier will have better results.

You can train the classifier using the sample dataset that they give you in the post and use the cross validation in order to get the best classifier.

After you got the best classifier, you can compare your classifier with the classifier provided in the post and choose the better.

OTHER TIPS

Datasets are all different and their content often changes (unpredictably) with time. Sometimes you will find that 100 annotated tweets are enough to reach very good performance, because the language use was uniform. Sometimes, tens of thousands of tweets will not be enough. And just when you think your classifier is good, two days pass and what people talk about and how they talk about it changes. That same classifier is now useless. There is a large body of research on active learning and content analysis in changing data streams. Here and here are some papers to start your research.

PS If possible, use ready-made data sets. From personal experience, data annotation is extremely hard. Tweets are very tedious to read, and after you have stared at them for one hour you will make many mistakes and be bored out of your mind.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top