Python NLTK exercise: chapter 5

Question 1

wsj is of type nltk.corpus.reader.util.ConcatenatedCorpusView that behaves like a list (this is why you can use functions like index()), but "behind the scenes" NLTK never reads the whole list into memory, it will only read those parts from a file object that it needs. It seems that if you iterate over a CorpusView object and use index() (which requires iterating again) at the same time, the file object will return None.

This way it works, though it is less elegant than a list comprehension:

  for i in range(len(wsj)):
    if wsj[i][0] in cfd2['VN'].keys():
      print wsj[(i-1):(i+1)]

Question 2

Looks like both the index call and the slicing cause an exception:

wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)
cfd2 = nltk.ConditionalFreqDist((t,w) for w,t in wsj)
wanted = cfd2['VN'].keys()

# just getting the index -> exception before 60 items
for w, t in wsj:
    if w in wanted:
        print wsj.index((w,t))

# just slicing -> sometimes finishes, sometimes throws exception
for i, (w,t) in enumerate(wsj):
    if w in wanted:
        print wsj[i-1:i+1]

I'm guessing it's caused by accessing previous items in a stream that you are iterating over.

It works fine if you iterate once over wsj to create a list of indices and use them in a second iteration to grab the slices:

results = [
    wsj[j-1:j+1]
    for j in [
        i for i, (w,t) in enumerate(wsj)
        if w in wanted
    ]
]

As a side note: calling index without a start argument will return the first match every time.

Question 3

wsj is of type ConcatenatedCorpusView, and I think it is choking on an empty tuple ('.', '.'). The easiest solution is to cast ConcatenatedCorpusView to a list explicitly. You can do that by doing:

wsj = list(wsj)

Iteration works fine then. Getting the index of a duplicate item is a separate problem. See: https://gist.github.com/denten/11388676