Domanda

say I download 'n' number of tweets and remove words with length <= 2 from them and then label each tweet as 'Negative' or 'Non negative', so that this forms my training set.

but instead of having well defined attributes like how an Iris data-set has Sepal Length, Sepal Width, Petal Length and Petal Width, in my data-set simply every word becomes an attribute and different example tweets will have different number of attributes.

Can I use this data-set and consider my problem as a classification problem ? and try to predict whether a new tweet is Negative or Non-Negative?

or what would you suggest as the best way to predict whether a tweet is negative or not ?

È stato utile?

Soluzione

You are describing a standar text classification problem. In this setting, the set of features is a (finite) set of words instead of the Sepal length, width, ...

As a result, each document is represented with respect to all such features (all documents have the same number of features) but most of the values will be zero, creating a very sparse vector.

This is the best way to predict polarity/sentiment but you should improve your knowledge of the topic a bit more. I would suggest a read of Sebastiani's survey on Text Classification.

Regards,

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top