Question

Linked: https://stackoverflow.com/questions/18154278/is-there-a-maximum-size-for-the-nltk-naive-bayes-classifer

I'm having trouble implementing a scikit-learn machine learning algorithm in my code. One of the authors of the scikit-learn kindly helped me in the question I linked above, but I can't quite get it working and as my original question was about a different matter, I thought it would be best to open a new one.

This code is taking an input of tweets and reading their text and sentiment into a dictionary. It then parses each line of text and adds the text to one list and its sentiment to another (at the advice of the author in the linked question above).

However, despite using the code in the link and looking up the API as best I can, I think I am missing something. Running the code below gives me first a bunch of output separated by a colon, like this:

  (0, 299)  0.270522159585
  (0, 271)  0.32340892262
  (0, 266)  0.361182814311
  : :
  (48, 123) 0.240644787937

followed by:

['negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', etc]

and then:

ValueError: empty vocabulary; perhaps the documents only contain stop words

Am I assigning the classifier in the wrong way? This is my code:

test_file = 'RawTweetDataset/SmallSample.csv'
#test_file = 'RawTweetDataset/Dataset.csv'
sample_tweets = 'SampleTweets/FlumeData2.txt'
csv_file = csv.DictReader(open(test_file, 'rb'), delimiter=',', quotechar='"')

tweetsDict = {}

for line in csv_file:
    tweetsDict.update({(line['SentimentText'],line['Sentiment'])})

tweets = []
labels = []
shortenedText = ""
for (text, sentiment) in tweetsDict.items():
    text = HTMLParser.HTMLParser().unescape(text.decode("cp1252", "ignore"))
    exclude = set(string.punctuation)
    for punct in string.punctuation:
        text = text.replace(punct,"")
    cleanedText = [e.lower() for e in text.split() if not e.startswith(('http', '@'))]
    shortenedText = [e.strip() for e in cleanedText if e not in exclude]

    text = ' '.join(ch for ch in shortenedText if ch not in exclude)
    tweets.append(text.encode("utf-8", "ignore"))
    labels.append(sentiment)

vectorizer = TfidfVectorizer(input='content')
X = vectorizer.fit_transform(tweets)
y = labels
classifier = MultinomialNB().fit(X, y)

X_test = vectorizer.fit_transform(sample_tweets)
y_pred = classifier.predict(X_test)

Update: Current code:

all_files = glob.glob (tweet location)
for filename in all_files:
    with open(filename, 'r') as file:
        for line file.readlines():
            X_test = vectorizer.transform([line])
            y_pred = classifier.predict(X_test)
            print line
            print y_pred

This always produces something like:

happy bday trish
['negative'] << Never changes, always negative
Was it helpful?

Solution

The problem is here:

X_test = vectorizer.fit_transform(sample_tweets)

fit_transform is intended to be called on the training set, not the test set. On the test set, call transform.

Also, sample_tweets is a filename. You should open it and read the tweets from it before passing it to a vectorizer. If you do that, then you should finally be able to do something like

for tweet, sentiment in zip(list_of_sample_tweets, y_pred):
    print("Tweet: %s" % tweet)
    print("Sentiment: %s" % sentiment)

OTHER TIPS

To do this in TextBlob (as alluded to in the comments), you would do

from text.blob import TextBlob

tweets = ['This is tweet one, and I am happy.', 'This is tweet two and I am sad']

for tweet in tweets:
    blob = TextBlob(tweet)
    print blob.sentiment #Will return (Polarity, Subjectivity)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top