Training and evaluating bigram/trigram distributions with NgramModel in nltk, using Witten Bell Smoothing

StackOverflow https://stackoverflow.com/questions/15697623

  •  30-03-2022
  •  | 
  •  

Question

I would like to train an NgramModel on one set of sentences, using Witten-Bell smoothing to estimate the unseen ngrams, and then use it to get the log-likelihood of a test set having been generated by that distribution. I want to do almost the same thing as in the documentation example found here: http://nltk.org/_modules/nltk/model/ngram.html, but with Witten-Bell smoothing instead. Here's some toy code trying to do about what I want to do:

from nltk.probability import WittenBellProbDist
from nltk import NgramModel

est = lambda fdist, bins: WittenBellProbDist(fdist)
fake_train = [str(t) for t in range(3000)]
fake_test = [str(t) for t in range(2900, 3010)]

lm = NgramModel(2, fake_train, estimator = est)

print lm.entropy(fake_test)

Unfortunately, when I try running that, I get the following error:

Traceback (most recent call last):
  File "ngram.py", line 8, in <module>
    lm = NgramModel(2, fake_train, estimator = est)
  File "/usr/lib/python2.7/dist-packages/nltk/model/ngram.py", line 63, in __init__
    self._model = ConditionalProbDist(cfd, estimator, len(cfd))
  File "/usr/lib/python2.7/dist-packages/nltk/probability.py", line 2016, in __init__
    **factory_kw_args)
  File "ngram.py", line 4, in <lambda>
    est = lambda fdist, bins: WittenBellProbDist(fdist)
  File "/usr/lib/python2.7/dist-packages/nltk/probability.py", line 1210, in __init__
    self._P0 = self._T / float(self._Z * (self._N + self._T))
ZeroDivisionError: float division by zero

What's causing this error? As far as I can tell I'm using everything correctly according to the documentation, and this works fine when I use Lidstone instead of Witten-Bell.

As a second question, I have data in the form of a collection of disjoint sentences. How can I use the sentences like a list of lists of strings, or do something equivalent that would produce the same distribution? (i.e. of course I could just use a list that has all the sentences with a dummy token separating subsequent sentences, but that wouldn't produce the same distribution.) The documentation in one place says that a list of list of strings is allowed, but then I found a bug report where the documentation was supposedly edited to reflect that that wasn't allowed (and when I just try a list of lists of strings I get an error).

Was it helpful?

Solution

Its apparently been a known issue for almost 3 years. The reason for ZeroDivisionError is because of the following lines in __init__,

if bins == None: 
    bins = freqdist.B() 
self._freqdist = freqdist 
self._T = self._freqdist.B() 
self._Z = bins - self._freqdist.B() 

Whenever the bins argument is not specified, it defaults to None so self._Z is really just freqdist.B() - freqdist.B() and

self._P0 = self._T / float(self._Z * (self._N + self._T))

reduces down to,

self._P0 = freqdist.B() / 0.0

Additionally, if you specify bins as any value greater than freqdist.B(), in executing this line of your code,

print lm.entropy(fake_test)

you will receive NotImplementedError because within the WittenBellProbDist class,

def discount(self): 
    raise NotImplementedError()

The discount method is apparently also used in prob and logprob of the NgramModel class so you won't be able to call them either.

One way to fix these problems, without changing NLTK, would be to inherit from WittenBellProbDist and override the relevant methods.

OTHER TIPS

Update Dec 2018

NLTK 3.4 contains the reworked ngram modeling module importable as nltk.lm

I would stay away from NLTK's NgramModel for the time being. There is currently a smoothing bug that causes the model to greatly overestimate likelihoods when n>1. This applies for all estimators including WittenBellProbDist and even LidstoneProbDist. I think this error has been around for a few years, suggesting that this part of NLTK is not well tested.

See: https://github.com/nltk/nltk/issues/367

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top