Question

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.

Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.

Stemmers

[in]: having
[out]: hav

Lemmatizers

[in]: having
[out]: have
  • So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English

  • If not, then how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?

  • How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?

Was it helpful?

Solution

Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"

Yes. Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving by driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.

Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?

What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?

If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.

If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, some might want it fined-grained.

Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"

What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).

With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although combination of machine learning and manually created rules has been done too (see e.g. this).

Obviously for most languages you do not want to create the lookup table by hand, but instead generate it from a description of morphology of that language. For inflectional languages, you can go the engineering way of Hajic for Czech or Mikheev for Russian, or, if you are daring, you use two-level morphology. Or you can do something in between, such as Hana (myself) (Note that these are all full morphological analyzers that include lemmatization). Or you can learn the lemmatizer in an unsupervised manner a la Yarowsky and Wicentowski, possibly with manual post-processing, correcting the most frequent words.

There are way too many options and it really all depends what you want to do with the results.

OTHER TIPS

One classical application of either stemming or lemmatization is the improvement of search engine results: By applying stemming (or lemmatization) to the query as well as (prior to indexing) to all tokens indexed, users searching for, say, "having" are able to find results containing "has".

(Arguably, verbs are somewhat uncommon in most search queries, but the same principle applies to nouns, especially in languages with a rich noun morphology.)

For the purpose of search result improvement, it is not actually important whether the stem (or lemma) is meaningful ("have") or not ("hav"). It only needs to able to represent the word in question, and all its inflectional forms. In fact, some systems use numbers or other kinds of id-strings instead of either stem or lemma (or base form or whatever it may be called).

Hence, this is an example of an application where stemmers (by your definition) are as good as lemmatizers.


However, I am not quite convinced that your (implied) definition of "stemmer" and "lemmatizer" are generally accepted. I am not sure if there is any generally accepted definition of these terms, but the way I define them is as follows:

Stemmer: A function that reduces inflectional forms to stems or base forms, using rules and lists of known suffixes.

Lemmatizer: A function that performs the same reduction, but using a comprehensive full-form dictionary to be able to deal with irregular forms.

Based on these definitions, a lemmatizer is essentially a higher-quality (and more expensive) version of a stemmer.

The answer is highly dependent on the task or specific field of study within the Natural Language Processing (NLP) that we are talking about.

It is worth pointing out that it has been proved that in some specific tasks, like Sentiment Analysis (that is a favorite sub-field in NLP), using a Stemmer or Lemmatizer as a feature in the development of a system (training a machine learning model) does not have a noticeable effect on the accuracy of the model no matter how great the tool is. Even though it makes the performance a little bit better, there are more important features like Dependency parsing that have a considerable potential to be worked on in such systems.

It is important to mention that the characteristics of the language which we are working on should also be taken into the consideration.

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:

  1. If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
  2. If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
  3. You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
  4. Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top