There's probably a fix for the numpy
overflow issue but since this is just a movie review classifier for learning NLTK / text classification (and you probably don't want training to take a long time anyway), I'll provide a simple workaround: you can just restrict the words used in feature sets.
You can find the 300
most commonly used words in all reviews like this (you can obviously make that higher if you want),
all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])
Then all you have to do is cross-reference top_words
in your feature extractor for reviews. Also, just as a suggestion, it's more efficient to use dictionary comprehension rather than convert a list
of tuple
s to a dict
. So this might look like,
def word_feats(words):
return {word:True for word in words if word in top_words}