Which word stemmer should I use in nltk?

https://stackoverflow.com/questions/1264847

13-09-2019
|

Question

My goal is to analyze some corpus (twitter for the now) for emotional content. Just today I realized it would make a bit of sense to search for word stems as opposed to having an exhaustive list of emotional word stems. And so I've been exploring nltk.stem only to realize that there are 4 different stemmers. I'd like to ask the stackoverflow linguists whether LancasterStemmer, PorterStemmer, RegexpStemmer, RSLPStemmer, or WordNetStemmer is best preferably with some justification.

Solution

RSLP is for portugese. I'm guessing you want english. Regexp would require you to develop your own stemming expressions, so I think that can be ignored as well. The WordnetStemmer requires that you know the part-of-speech for the word, so you'd have to do pos tagging first in order to use it. I've used the porter stemming algorithm and its pretty good, but the lancaster algorithm is newer, so it might be better. You might want to try using a combination of stemmers, where you choose the shortest stem from each stemmer. Anyway, bottom line is that PorterStemmer is a good default choice.

OTHER TIPS

It may be a bit different than you are asking, but the Nodebox Lingustics library contains an is_emotive() function which seems to check words to see if they are recursive hyponyms of certain emotional words. From commonsense.py

    ekman = ["anger", "disgust", "fear", "joy", "sadness", "surprise"]
    other = ["emotion", "feeling", "expression"]

Not a stemmer, but an interesting approach to check out.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow