Pregunta

I'm looking for a library which can perform a morphological analysis on German words, i.e. it converts any word into its root form and providing meta information about the analysed word.

For example:

gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund

My wishlist:

  • It has to work with both nouns and verbs.
  • I'm aware that this is a very hard task given the complexity of the German language, so I'm also looking for libaries which provide only approximations or may only be 80% accurate.
  • I'd prefer libraries which don't work with dictionaries, but again I'm open to compromise given the cirumstances.
  • I'd also prefer C/C++/Delphi Windows libraries, because that would make them easier to integrate but .NET, Java, ... will also do.
  • It has to be a free library. (L)GPL, MPL, ...

EDIT: I'm aware that there is no way to perform a morphological analysis without any dictionary at all, because of the irregular words. When I say, I prefer a library without a dictionary I mean those full blown dictionaries which map each and every word:

arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
... 

Those dictionaries have several drawbacks, including the huge size and the inability to process unknown words.

Of course all exceptions can only be handled with a dictionary:

esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...

(My mind is spinning right now :) )

¿Fue útil?

Solución

I think you are looking for a "stemming algorithm".

Martin Porter's approach is well known among linguists. The Porter stemmer is basically an affix stripping algorithm, combined with a few substitution rules for those special cases.

Most stemmers deliver stems that are linguistically "incorrect". For example: both "beautiful" and "beauty" can result in the stem "beauti", which, of course, is not a real word. This doesn't matter, though, if you're using those stems to improve search results in information retrieval systems. Lucene comes with support for the Porter stemmer, for instance.

Porter also devised a simple programming language for developing stemmers, called Snowball.

There are also stemmers for German available in Snowball. A C version, generated from the Snowball source, is also available on the website, along with a plain text explanation of the algorithm.

Here's the German stemmer in Snowball: http://snowball.tartarus.org/algorithms/german/stemmer.html

If you're looking for the corresponding stem of a word as you would find it in a dictionary, along with information on the part of speech, you should Google for "lemmatization".

Otros consejos

(Disclaimer: I'm linking my own Open Source projects here)

This data in form of a word list is available at http://www.danielnaber.de/morphologie/. It could be combined with a word splitter library (like jwordsplitter) to cover compound nouns not in the list.

Or just use LanguageTool from Java, which has the word list embedded in form of a compact finite state machine (plus it also includes compound splitting).

You asked this a while ago, but you might still give it a try with morphisto.

Here's an example on how to do it in Ubuntu:

  1. Install the Stuttgart finite-state transducer tools

    $ sudo apt-get install sfst

  2. Download the morphisto morphology, e.g. morphisto-02022011.a

  3. Compact it, e.g.

    $ fst-compact morphisto-02022011.a morphisto-02022011.ac

  4. Use it! Here are some examples:

    $ echo Hochzeit | fst-proc morphisto-02022011.ac ^Hochzeit/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>$

    $ echo gearbeitet | fst-proc morphisto-02022011.ac ^gearbeitet/arbeiten<+ADJ>/arbeiten<+ADJ>/arbeiten<+V>$

Have a look at LemmaGen (http://lemmatise.ijs.si/) which is a project that aims at providing standardized open source multilingual platform for lemmatisation. It is doing exactly what you want.

I don't think that this can be done without a dictionary.

Rules-based approaches will invariably trip over things like

gegessen -> essen
gegangen -> angen

(note to people who don't speak german: the correct solution in the second case is "gehen").

Have a look at Leo. They offer the data which you are after, maybe it gives you some ideas.

One can use morphisto with ParZu (https://github.com/rsennrich/parzu). ParZu is a dependency parser for German.

This means that the ParZu also disambiguate the output from morphisto

There are some tools out there which you could use like the morph. component in the Matetools, Morphisto etc. But the pain is to integrate them in your tool chain. A very good wrapper around quite a lot of these linguistic tools is DKpro (https://dkpro.github.io/dkpro-core/), a framework using UIMA. It allows you to write your own preprocessing pipeline using different linguistic tools from different resources which are all downloaded automatically on your computer and speak to each other. You can use Java or Groovy or even Jython to use it. DKPro provides you easy access to two morphological analyzers, MateMorphTagger and SfstAnnotator.

You don't want to use a stemmer like Porter, it will reduce the word form in a way which does not make any sense linguistically and does not have the behaviour you describe. If you only want to find the basic form, for a verb that would be the infinitive and for a noun the nominative singular, then you should use a lemmatizer. You can find a list of German lemmatizers here. Treetagger is widely used. You can also use a more complex analysis provided by a morphological analyzer like SMORS. It will give you something like this (example from the SMORS website):

And here is the analysis of "unübersetzbarstes" showing prefixation, suffixation and >gradation: un<PREF>übersetzen<V>bar<SUFF><+ADJ><Sup><Neut><Nom><Sg><St>

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top