Frage

I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":

  • "Antonia ist mit Tina im Tal."
  • "Tamina malt mit lila Tinte Enten."
  • "Tina nimmt alle Tomaten mit."

For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:

Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction? Can you recommend a paper to this topic?

War es hilfreich?

Lösung

Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.

Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.

Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:

  1. Pick first word at random, checking that it is constructed solely from your desired set of characters
  2. Use the Markov Chain to predict the next word
  3. Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
  4. If not, predict another word until it is the case.
  5. If so, then repeat starting at 2 to completion.

Hope that's what you're looking for. Let me know if you have any more questions.

Andere Tipps

As Slater Tyranus already said, Markov chains certainly form the basis of this task. I am going to suggest a more heavy-duty approach. It is considerably more work, but is likely to give much better results in terms of grammatical correctness.

Language Model based on PCFG parse trees: A language model works by assigning a probability to a sequence of words. It requires training data, however, in order to be built first. In your case, the training process should disregard words containing letters outside the limited set.

While theoretically a language model based on parse trees is much more likely to serve your purpose, there is one caveat: due to the kind of letter-based restriction you have, data sparsity will certainly raise its ugly head. Backoff techniques (e.g. Katz's backoff model) can help a bit, but it will essentially depend on whether or not you can train on enough enough data.

As far as readily available parsers are concerned, the Stanford NLP group provides a German parser based on the Negra corpus, as mentioned in their home page.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top