NLTK FreqDest objects comparison

Question 1

The FreqDist.__init__(samples) constructor creates a dict where,

key = sample
value = count (frequency) of the sample

So in your case:

nltk.FreqDist('ageqwst')
<FreqDist: 'a': 1, 'e': 1, 'g': 1, 'q': 1, 's': 1, 't': 1, 'w': 1>

Then in your list comprehension statement,

[word for word in words if nltk.FreqDist(word) <= letters]

it's doing the same thing for each of the words in the corpus so it now has two FreqDist dictionaries it can compare with your if clause. Given the operator <=, it is looking for words that have a frequency less than/equal to (duh) those in the sample, letters. The important thing to note here is the less than piece. This allows it skip letters in words that our sample does not contain.

So if we change the operator to be explicit,

[word for word in words if nltk.FreqDist(word) == letters]

it would return an empty list since there are no words in the provided corpus that have a singular occurrence of any of the samples, 'ageqwst'.

Take this statement for example:

words = nltk.corpus.words.words()
foo = nltk.FreqDist('foo')

print [word for word in words if nltk.FreqDist(word) <= foo]
>>> ['f', 'foo', 'o', 'of', 'of']

No surprises here and we also see that out original sample ('foo') appears in the list as well so if we change our operator to be explicit,

print [word for word in words if nltk.FreqDist(word) == foo]
>>> ['foo']

we get a list of the only word that has the exact same sample distribution as ours.

One final example:

words = nltk.corpus.words.words()
bar = nltk.FreqDist('bar')

print [word for word in words if nltk.FreqDist(word) <= bar]
>>> ['a', 'ar', 'b', 'ba', 'bar', 'bra', 'r', 'ra', 'rab', 'a']

We still see our sample ('bar') appears in the list, however, there are two other words with the same sample distribution as ours so if we,

print [word for word in words if nltk.FreqDist(word) == bar]
>>> ['bar', 'bra', 'rab']

we still get our original sample ('bar') plus two other iterations of the sample, 'bra' and 'rab'. This highlights the fact the order of the sample is irrelevant which is consistent with the behavior of python mapping types.

I would highly recommend you read through the NLTK Book. Yes it's long and yes it's dry at times but it goes into a lot of the theory and methodology on the different modules. So based on the level of intrigue in your question, I think you would find it insightful.

Question 2

Basically we get a dictionary with each individual character as the key and the count (frequency) of that particular letter in the word as the value. So if we have:

fdist = nltk.FreqDist('abcdefg')

We'll get:

FreqDist({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1,'f': 1, 'g': 1})

So, every letter appears only once. Next if we use:

wordlist = nltk.corpus.words.words()

We will get the whole words corpus to compare with our sample fdist dictionary. Now if write this List Comprehension:

[w for w in wordlist if nltk.FreqDist(w) <= fdist]

We will get a whole bunch of words with different combinations of letters that are present in our string 'abcdefg' with each letter appearing not more than its frequency as given in the dictionary fdist. The output is of the form:

['a','abed','ace','ad','ade','ae','age','aged','b','ba','bac','bad','bade',...]