Question

I am using the NgramModel from nltk to calculate the probability of finding a certain word in a sentence. My problem is that each word gives exactly the same probability every time, regardless of the context! Here is some sample code that demonstrates my problem.

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
from nltk.model import NgramModel

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

lm = NgramModel(3, brown.words(categories='news'), estimator=estimator)
>>> print lm.prob("word", ["This is a context which generates a word"])
0.00493261081006
>>> print lm.prob("word", ["This is a context of a word"])
0.00493261081006
>>> print lm.prob("word", ["This word"])
0.00493261081006
>>> print lm.prob("word", ["word"])
0.00493261081006
>>> print lm.prob("word", ["adnga"])
0.00493261081006
Was it helpful?

Solution

The context of word shouldn't contain the word itself, unless you have a repeated word. The brown corpus is small, so unless you hit a trigram that was actually observed in the data, you will get the same answer. In my example, I use bigrams instead so that I am not constantly hitting the smoothing model. In your example, you are hitting the smoothing model every time. Third, in practice, LidstoneProbDist is pretty bad, it's the simplest thing that could possibly work when smoothing, not something you'd want to use in practice. The SimpleGoodTuringProbDist is much better.

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist
from nltk.model import NgramModel

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

lm = NgramModel(2, brown.words(categories='news'), estimator=estimator)

lm.prob("good", ["very"])          # 0.0024521936223426436
lm.prob("good", ["not"])           # 0.0019510849023145812
lm.prob("good", ["unknown_term"])  # 0.017437821314436573
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top