CountVectorizer
will extract trigrams for you (using ngram_range=(3, 3)
). The text feature extraction documentation introduces this. Then, just use MultinomialNB
exactly like before with the transformed feature matrix.
Note that this is actually modeling:
P(document | label) = P(wordX, wordX-1, wordX-2 | label) * P(wordX-1, wordX-2, wordX-3 | label) * ...
How different is that? Well, that first term can be written as
P(wordX, wordX-1, wordX-2 | label) = P(wordX | wordX-1, wordX-2, label) * P(wordX-1, wordX-2 | label)
Of course, all the other terms can be written that way too, so you end up with (dropping the subscripts and the conditioning on the label for brevity):
P(X | X-1, X-2) P(X-1 | X-2, X-3) ... P(3 | 2, 1) P(X-1, X-2) P(X-2, X-3) ... P(2, 1)
Now, P(X-1, X-2) can be written as P(X-1 | X-2) P(X-2). So if we do that for all those terms, we have
P(X | X-1, X-2) P(X-1 | X-2, X-3) ... P(3 | 2, 1) P(X-1 | X-2) P(X-2 | X-3) ... P(2 | 1) P(X-2) P(X-1) ... P(1)
So this is actually like using trigrams, bigrams, and unigrams (though not estimating the bigram/unigram terms directly).