Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary

https://stackoverflow.com/questions/3428131

26-09-2019
|

Question

I have:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

which is straight from the NLTK manual. What I want to do next is to compare vocab to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.

Solution

If your English dictionary is indeed a set (hopefully of lowercased words),

set(vocab) - english_dictionary

gives you the set of words which are in the vocab set but not in the english_dictionary one. (It's a pity that you turned vocab into a list by that sorted, since you need to turn it back into a set to perform operations such as this set difference!).

If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)

Edit: given the OP's edit shows that both words (what was previously called vocab) and englishwords (what I previously called english_dictionary) are in fact lists of lowercased words, then

newwords = set(words) - set(englishwords)

newwords = set(words).difference(englishwords)

are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference still needs to do some kind of "transformation-to-set"-like operation).

If you're keen to have a list as the result instead of a set, sorted(newwords) will give you an alphabetically sorted list (list(newwords) would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow