Question

I noticed that after applying Porter stemming (from NLTK library) I get strange stems such as "cowardli" or "contrari". For me they don't look like stems at all.

Is it okay? Could it be that I made a mistake smwhere?

Here is my code:

string = string.lower()
tokenized = nltk.tokenize.regexp_tokenize(string,"[a-z]+")
filtered = [w for w in tokenized if w not in nltk.corpus.stopwords.words("english")]


stemmer = nltk.stem.porter.PorterStemmer()
stemmed = []
for w in filtered:
    stemmed.append(stemmer.stem(w))

And here is the text I used for processing http://pastebin.com/XUMNCYAU (beginning of "crime and punishment" book by Dostoevsky).

Was it helpful?

Solution

First let's look at the different stemmers/lemmatizer that NLTK has:

>>> from nltk import stem
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> porter = stem.porter.PorterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> wnl = stem.wordnet.WordNetLemmatizer()
>>> word = "cowardly"
>>> lancaster.stem(word)
'coward'
>>> porter.stem(word)
u'cowardli'
>>> snowball.stem(word)
u'coward'
>>> wnl.stem(word)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'WordNetLemmatizer' object has no attribute 'stem'
>>> wnl.lemmatize(word)
'cowardly'

Note: WordNetLemmatizer is not a stemmer, thus it outputs the lemmatize of cowardly and in this case it is the same word.

Seems like Porter stemmer is the only one that changes cowardly -> cowardli, let's look at the code to see why it happens, see http://www.nltk.org/_modules/nltk/stem/porter.html#PorterStemmer.

It seems like this is the part that is the ly -> li:

def _step1c(self, word):
    """step1c() turns terminal y to i when there is another vowel in the stem.
    --NEW--: This has been modified from the original Porter algorithm so that y->i
    is only done when y is preceded by a consonant, but not if the stem
    is only a single consonant, i.e.

       (*c and not c) Y -> I

    So 'happy' -> 'happi', but
      'enjoy' -> 'enjoy'  etc

    This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->
    'enjoy'. Step 1c is perhaps done too soon; but with this modification that
    no longer really matters.

    Also, the removal of the vowelinstem(z) condition means that 'spy', 'fly',
    'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried',
    'flies' ...
    """
    if word[-1] == 'y' and len(word) > 2 and self._cons(word, len(word) - 2):
        return word[:-1] + 'i'
    else:
        return word
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top