Question

In understanding string matching: What is the exact difference between word stemming and depluralization?

Or do they mean the same thing?

Was it helpful?

Solution

First, stemming refers to the process of reducing a word to its stem. However, that may mean a number of different things. Most linguists differentiate between at least two ways of doing it:

  1. Removing grammatical, but not derivational morphemes. Grammatical morphemes are components of the word that are related to its grammatical role in a particular sentence, e.g. number, case, gender, tense, aspect etc.

  2. Removing both grammatical and derivational morphemes. Derivational morphemes are components of the word that are related to its derivation from another word, e.g. the "-er" in "worker" is related to how it is derived (or can be considered as derived) from "work".

Therefore, depluralization, which is a rather unusual term, but obviously refers to removing a plural morpheme (such as the "-s" at the end of "computers"), is part of a kind of stemming, specifically a part of the removal of grammatical (but not derivational) morphemes.

In English, the morphology of nouns is largely limited to plural ("computers") and genitive (second case, "computer's"), hence as far as English is concerned, depluralization may be seen as (almost) synonymous with (grammatical) stemming, at least to the extent that stemming is applied to nouns, and, to some degree, adjectives, (which it is e.g. in the context of information retrieval). However, wherever verbs are considered, past tense, passive voice and other inflectional forms are subject to stemming (but not to depluralization).

Furthermore, in languages other than English, even nouns may have a very rich morphology, including morphemes for such things as case, politeness level, or special kinds of plural (such as dual). And then, depluralization (if you want to use that term at all) would refer to only a very small part of the overall stemming process.

Another related term is lemmatization, which is often used synonymously with stemming. One distinction between the two that I found many people (including myself) to make is this:

  • Stemming is used to refer to a rule-based or machine-learning based technique that removes parts of a word (mostly endings) that look like grammatical morphemes

  • Lemmatization is used to refer to a process that does the same, but using an actual dictionary of the language to deal with highly irregular forms (such as the plural "women")

(But, again, not everyone will agree with this distinction.)

OTHER TIPS

They are not the same. There are a few approaches to stemming a word, depluralization is one strategy.

just one quick example: a stemmer might stem "childish" into "child", or the word "stemmer" into "stem", while a depluralization algorithm will not.

Stemming is converting multiple words with the same root to one word. Ex. "cats", "catlike", "catty" to "cat"

Depluralization is converting plural words into singular. Ex. "cats" to "cat"

Additional info for stemming and algorithms http://en.wikipedia.org/wiki/Stemming#Algorithms

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top