Linked: https://stackoverflow.com/questions/18154278/is-there-a-maximum-size-for-the-nltk-naive-bayes-classifer
I'm having trouble implementing a scikit-learn machine learning algorithm in my code. One of the authors of the scikit-learn kindly helped me in the question I linked above, but I can't quite get it working and as my original question was about a different matter, I thought it would be best to open a new one.
This code is taking an input of tweets and reading their text and sentiment into a dictionary. It then parses each line of text and adds the text to one list and its sentiment to another (at the advice of the author in the linked question above).
However, despite using the code in the link and looking up the API as best I can, I think I am missing something. Running the code below gives me first a bunch of output separated by a colon, like this:
(0, 299) 0.270522159585
(0, 271) 0.32340892262
(0, 266) 0.361182814311
: :
(48, 123) 0.240644787937
followed by:
['negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', etc]
and then:
ValueError: empty vocabulary; perhaps the documents only contain stop words
Am I assigning the classifier in the wrong way? This is my code:
test_file = 'RawTweetDataset/SmallSample.csv'
#test_file = 'RawTweetDataset/Dataset.csv'
sample_tweets = 'SampleTweets/FlumeData2.txt'
csv_file = csv.DictReader(open(test_file, 'rb'), delimiter=',', quotechar='"')
tweetsDict = {}
for line in csv_file:
tweetsDict.update({(line['SentimentText'],line['Sentiment'])})
tweets = []
labels = []
shortenedText = ""
for (text, sentiment) in tweetsDict.items():
text = HTMLParser.HTMLParser().unescape(text.decode("cp1252", "ignore"))
exclude = set(string.punctuation)
for punct in string.punctuation:
text = text.replace(punct,"")
cleanedText = [e.lower() for e in text.split() if not e.startswith(('http', '@'))]
shortenedText = [e.strip() for e in cleanedText if e not in exclude]
text = ' '.join(ch for ch in shortenedText if ch not in exclude)
tweets.append(text.encode("utf-8", "ignore"))
labels.append(sentiment)
vectorizer = TfidfVectorizer(input='content')
X = vectorizer.fit_transform(tweets)
y = labels
classifier = MultinomialNB().fit(X, y)
X_test = vectorizer.fit_transform(sample_tweets)
y_pred = classifier.predict(X_test)
Update: Current code:
all_files = glob.glob (tweet location)
for filename in all_files:
with open(filename, 'r') as file:
for line file.readlines():
X_test = vectorizer.transform([line])
y_pred = classifier.predict(X_test)
print line
print y_pred
This always produces something like:
happy bday trish
['negative'] << Never changes, always negative