Pregunta

Ran the following:

from nltk import WordNetLemmatizer as wnl
wnl().lemmatize("American")
wnl().lemmatize("Americans")

Both of which simply return their argument. I would like Americans to reduce down to American. Anybody have any idea how to make this happen?

I assumed I'd have to modify whatever internal dictionary the lemmatizer is using. Is this correct? Anybody know a better way?

Thanks!

¿Fue útil?

Solución

You can convert the word to lower case before giving it to the lemmatizer, and restore the case afterwards.

I have used this code in the past:

word = 'Americans'
lemmatized = wnl().lemmatize(word.lower())
if word.istitle():
    word = lemmatized.capitalize()
else:
    word = lemmatized
# word = 'American'

This assumes that there is no case of multiple upper case letters in a word (like "MySpace"), which was true for my case that time. I think this is generally true, since words with multiple uppercase letters tend to be a proper noun, and hence there is usually no need to lemmatize them.

If you're concerned with all UPPERCASE word, you can include that case also:

word = 'AMERICANS'
lemmatized = wnl().lemmatize(word.lower())
if word.istitle():
    word = lemmatized.capitalize()
elif word.upper()==word:
    word = lemmatized.upper()
else:
    word = lemmatized
# word = 'AMERICAN'

Otros consejos

Apparently case matters to WordNet, but you can also use PorterStemmer.

>>> wnl().lemmatize('americans')
'american'
>>> from nltk.stem import PorterStemmer as ps
>>> ps().stem('Americans')
'American'
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top