Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary
Question
I have:
from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]
which is straight from the NLTK manual. What I want to do next is to compare vocab
to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.
Solution
If your English dictionary is indeed a set (hopefully of lowercased words),
set(vocab) - english_dictionary
gives you the set of words which are in the vocab
set but not in the english_dictionary
one. (It's a pity that you turned vocab
into a list by that sorted
, since you need to turn it back into a set to perform operations such as this set difference!).
If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)
Edit: given the OP's edit shows that both words
(what was previously called vocab
) and englishwords
(what I previously called english_dictionary
) are in fact lists of lowercased words, then
newwords = set(words) - set(englishwords)
or
newwords = set(words).difference(englishwords)
are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords
into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference
still needs to do some kind of "transformation-to-set"-like operation).
If you're keen to have a list as the result instead of a set, sorted(newwords)
will give you an alphabetically sorted list (list(newwords)
would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).