Question

I'd like to write a simple function to see if this word 'exists' in WordNet via NLTK.

def is_known(word):
    """return True if this word "exists" in WordNet
       (or at least in nltk.corpus.stopwords)."""
    if word.lower() in nltk.corpus.stopwords.words('english'):
        return True
    synset = wn.synsets(word)
    if len(synset) == 0:
        return False
    else:
        return True

Why would words like could, since, without, although return False? Don't they appear in WordNet? Is there any better way to find out whether a word exists in WN (using NLTK)?

My first try was to eliminate "stopwords" which are words like to, if, when, then, I, you, but there are still very common words (like could) which I can't find.

Was it helpful?

Solution

WordNet does not contain these words or words like them. For an explanation, see the following from the WordNet docs:

Q. Why is WordNet missing: of, an, the, and, about, above, because, etc.
A. WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles.

You also won't find these kinds of words in the online version of WordNet.

OTHER TIPS

You can try to extract all the lemmas in wordnet and then check against that list:

from nltk.corpus import wordnet as wn
from itertools import chain
all_lemmas = set(chain(*[i.lemma_names for i in wn.all_synsets()]))

def in_wordnet(word):
  return True if word in all_lemmas else False

print in_wordnet('can')
print in_wordnet('could')

[out]:

True
False

Do note that wordnet contains lemmas and not words. Also do note that a word/lemma can be polysemous and not a really a contain word, e.g.

I can foo bar. vs The water can is heavy

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top