Python Maxent Classifier

Question 1

There's probably a fix for the numpy overflow issue but since this is just a movie review classifier for learning NLTK / text classification (and you probably don't want training to take a long time anyway), I'll provide a simple workaround: you can just restrict the words used in feature sets.

You can find the 300 most commonly used words in all reviews like this (you can obviously make that higher if you want),

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

Then all you have to do is cross-reference top_words in your feature extractor for reviews. Also, just as a suggestion, it's more efficient to use dictionary comprehension rather than convert a list of tuples to a dict. So this might look like,

def word_feats(words):
    return {word:True for word in words if word in top_words}

Question 2

I changed and update the code a bit.

import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from sklearn import cross_validation


from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
 return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
#classifier = nltk.MaxentClassifier.train(trainfeats)

algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3)

classifier.show_most_informative_features(10)

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

def word_feats(words):
    return {word:True for word in words if word in top_words}