Question

I would like to apply lemmatization to reduce the inflectional forms of words. I know that for English language WordNet provides such a functionality, but I am also interested in applying lemmatization for Dutch, French, Spanish and Italian words. Is there any trustworthy and confirmed way to go about this? Thank you!

Was it helpful?

Solution

Try pattern library from CLIPS, they have support for German, English, Spanish, French and Italian. Just what you needed: http://www.clips.ua.ac.be/pattern

Unfortunately it only works with Python 2, no support for Python3 provided yet.

OTHER TIPS

The textacy library http://textacy.readthedocs.io/en/latest/api_reference.html provides the essential tools for building a bag of words or bag of terms with lemmatization included as part of the options on it. I've tried it with Spanish and works quite OK.

doc.to_bag_of_terms(ngrams=2, named_entities=True, lemmatize=True, as_strings=True)

The library automatically checks the language you're writing in and lemmatize according to it. However, you can also specify it here.

import textacy
text = 'Los gatos y los perros juegan juntos en el patio de su casa'
doc = textacy.Doc(text, lang='es')
print(doc.to_bag_of_words(normalize='lemma', as_strings=True))

You'll get an output as the following {'perro': 1, 'y': 1, 'gato': 1, 'jugar': 1, 'casar': 1, 'Los': 1, 'patio': 1}

The library recognizes well some of the words, however, the lemmas were not perfectly recognized. Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top