Question

say I download 'n' number of tweets and remove words with length <= 2 from them and then label each tweet as 'Negative' or 'Non negative', so that this forms my training set.

but instead of having well defined attributes like how an Iris data-set has Sepal Length, Sepal Width, Petal Length and Petal Width, in my data-set simply every word becomes an attribute and different example tweets will have different number of attributes.

Can I use this data-set and consider my problem as a classification problem ? and try to predict whether a new tweet is Negative or Non-Negative?

or what would you suggest as the best way to predict whether a tweet is negative or not ?

Was it helpful?

Solution

You are describing a standar text classification problem. In this setting, the set of features is a (finite) set of words instead of the Sepal length, width, ...

As a result, each document is represented with respect to all such features (all documents have the same number of features) but most of the values will be zero, creating a very sparse vector.

This is the best way to predict polarity/sentiment but you should improve your knowledge of the topic a bit more. I would suggest a read of Sebastiani's survey on Text Classification.

Regards,

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top