Question

I am implementing a search application. Corpus is large text documents. During file process i'm tokenizing all the words and calling Porter Stemmer algorithm Step1 (http://tartarus.org/~martin/PorterStemmer/csharp2.txt).

Step1 gets rid of plurals and -ed or -ing...

I noticed that a word like 'this' will be stemmed into 'thi'.

Is that normal operation of the algorithm ? Since I wanted to tokenize the word 'this'.

Was it helpful?

Solution

From what you describe, my hunch is that this is considered as plural form in Porter Stemmer algorithm and reduced to thi.

I do not find an explicit reference to non-plural words ending with s in Porter's paper.

http://tartarus.org/~martin/PorterStemmer/def.txt

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top