문제

I am implementing a search application. Corpus is large text documents. During file process i'm tokenizing all the words and calling Porter Stemmer algorithm Step1 (http://tartarus.org/~martin/PorterStemmer/csharp2.txt).

Step1 gets rid of plurals and -ed or -ing...

I noticed that a word like 'this' will be stemmed into 'thi'.

Is that normal operation of the algorithm ? Since I wanted to tokenize the word 'this'.

도움이 되었습니까?

해결책

From what you describe, my hunch is that this is considered as plural form in Porter Stemmer algorithm and reduced to thi.

I do not find an explicit reference to non-plural words ending with s in Porter's paper.

http://tartarus.org/~martin/PorterStemmer/def.txt

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top