The FreqDist.__init__(samples)
constructor creates a dict
where,
- key = sample
- value = count (frequency) of the sample
So in your case:
nltk.FreqDist('ageqwst')
<FreqDist: 'a': 1, 'e': 1, 'g': 1, 'q': 1, 's': 1, 't': 1, 'w': 1>
Then in your list comprehension statement,
[word for word in words if nltk.FreqDist(word) <= letters]
it's doing the same thing for each of the words in the corpus so it now has two FreqDist
dictionaries it can compare with your if
clause. Given the operator <=
, it is looking for words that have a frequency less than/equal to (duh) those in the sample, letters
. The important thing to note here is the less than piece. This allows it skip letters in words that our sample does not contain.
So if we change the operator to be explicit,
[word for word in words if nltk.FreqDist(word) == letters]
it would return an empty list since there are no words in the provided corpus that have a singular occurrence of any of the samples, 'ageqwst'.
Take this statement for example:
words = nltk.corpus.words.words()
foo = nltk.FreqDist('foo')
print [word for word in words if nltk.FreqDist(word) <= foo]
>>> ['f', 'foo', 'o', 'of', 'of']
No surprises here and we also see that out original sample ('foo') appears in the list as well so if we change our operator to be explicit,
print [word for word in words if nltk.FreqDist(word) == foo]
>>> ['foo']
we get a list of the only word that has the exact same sample distribution as ours.
One final example:
words = nltk.corpus.words.words()
bar = nltk.FreqDist('bar')
print [word for word in words if nltk.FreqDist(word) <= bar]
>>> ['a', 'ar', 'b', 'ba', 'bar', 'bra', 'r', 'ra', 'rab', 'a']
We still see our sample ('bar') appears in the list, however, there are two other words with the same sample distribution as ours so if we,
print [word for word in words if nltk.FreqDist(word) == bar]
>>> ['bar', 'bra', 'rab']
we still get our original sample ('bar') plus two other iterations of the sample, 'bra' and 'rab'. This highlights the fact the order of the sample is irrelevant which is consistent with the behavior of python mapping types.
I would highly recommend you read through the NLTK Book. Yes it's long and yes it's dry at times but it goes into a lot of the theory and methodology on the different modules. So based on the level of intrigue in your question, I think you would find it insightful.