Question

I am trying to implement porter stemming algorithm, but I stumbled at this point

where the square brackets denote arbitrary presence of their contents. Using (VC){m} to denote VC repeated m times, this may again be written as

[C](VC){m}[V].

m will be called the \measure\ of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples:

m=0    TR,  EE,  TREE,  Y,  BY.
m=1    TROUBLE,  OATS,  TREES,  IVY.
m=2    TROUBLES,  PRIVATE,  OATEN,  ORRERY.

I don't understand what is this "measure" and what does it stand for?

Was it helpful?

Solution

Looks like the measure is the number of times a vowel is immediately followed by a consonant. For example,

"TROUBLES" has:

Optional initial consonants [C] = "TR".

First vowels-consonants group (VC) = "OUBL".

Second vowels-consonants group (VC) = "ES".

Optional ending vowels [V] is empty.

So the measure is two, the number of times (VC) was "matched".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top