Question

I'm using the Wordnet Lemmatizer via NLTK on the Brown Corpus (to determine if the nouns in it are used more in their singular form or their plural form).
i.e. from nltk.stem.wordnet import WordNetLemmatizer
l = WordnetLemmatizer()

I've noticed that even the simplest queries such as the one below takes quite a long time (at least a second or two).
l("cats")

Presumably this is because a web connection must be made to Wordnet for each query?..
I'm wondering if there is a way to still use the Wordnet Lemmatizer but have it perform much faster? For instance, would it help at all for me to download Wordnet on to my machine? Or any other suggestions?

I'm trying to figure out if the Wordnet Lemmatizer can be made faster rather than trying a different lemmatizer, because I've found it works the best among others like Porter and Lancaster.

Was it helpful?

Solution 2

I've used the lemmatizer like this

from nltk.stem.wordnet import WordNetLemmatizer # to download corpora: python -m    nltk.downloader all
lmtzr = WordNetLemmatizer() # create a lemmatizer object
lemma = lmtzr.lemmatize('cats')

It is not slow at all on my machine. There is no need to connect to the web to do this.

OTHER TIPS

It doesn't query the internet, NLTK reads WordNet from your local machine. When you run the first query, NLTK loads WordNet from disk into memory:

>>> from time import time
>>> t=time(); lemmatize('dogs'); print time()-t, 'seconds'
u'dog'
3.38199806213 seconds
>>> t=time(); lemmatize('cats'); print time()-t, 'seconds'
u'cat'
0.000236034393311 seconds

It is rather slow if you have to lemmatize many thousands of phrases. However if you are doing a lot of redundant queries, you can get some speedup by caching the results of the function:

from nltk.stem import WordNetLemmatizer
from functools32 import lru_cache
wnl = WordNetLemmatizer()
lemmatize = lru_cache(maxsize=50000)(wnl.lemmatize)

lemmatize('dogs')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top