Question

I'm reading the book 'The Practice of Programming' by Brian W. Kernighan and Rob Pike. Chapter 3 provides the algorithm for a Markov chain approach that reads a source text and uses it to generate random text that "reads well" (meaning the output is closer to proper-sounding English than gibberish):

set w1 and w2 to the first two words in the source text
print w1 and w2
loop:
   randomly choose w3, one of the successors of prefix w1 and w2 in the source text
   print w3
   replace w1 and w2 by w2 and w3
   repeat loop

My question is: what's the standard way to handle the situation where the new values for w2 and w3 have no successor in the source text?

Many thanks in advance!

Was it helpful?

Solution

Here your options:

  1. Choose a word at random? (Always works)
  2. Choose a new W2 at random? (Can conceivably still loop)
  3. Back up to previous W1 and W2? (Can conceivably still loop)

I'd probably go with trying #2 or #3 once, then fallback to #1 -- which will always work.

OTHER TIPS

The situation you are describing considers 3-grams, that is the statistical frequency of a 3-tuple in a given dataset. To create a Markov matrix with no adsorbing states, that is no points where a f_2(w1,w2) -> w3 and f_2(w2,w3) = 0, you'll have to extend the possibilities. A generalized extension to @ThomasW's answers would be:

  1. If the set predictor f_2(w1,w2) != 0 draw from that
  2. If the set predictor f_1(w2) != 0 draw from that
  3. If the set predictor f_0() != 0 draw from that

That is, draw like normally from the 3-gram set, than the 2-gram set than the 1-gram set. At the last step you'll simply be drawing a word at random weighted by it's statistical frequency.

I believe that this is a serious problem in NLP, one without a straightforward solution. One approach is to tag the parts of speech in addition to the actual words, in order to generalize the mappings. Using parts of speech, the program can at least predict what part of speech should follow the words W2 and W3 if there is no precedent for the word sequence. "Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model." From Columbia notes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top