Regarding the use of the Markov chain algorithm for generating text

Question 1

Here your options:

Choose a word at random? (Always works)
Choose a new W2 at random? (Can conceivably still loop)
Back up to previous W1 and W2? (Can conceivably still loop)

I'd probably go with trying #2 or #3 once, then fallback to #1 -- which will always work.

Question 2

The situation you are describing considers 3-grams, that is the statistical frequency of a 3-tuple in a given dataset. To create a Markov matrix with no adsorbing states, that is no points where a f_2(w1,w2) -> w3 and f_2(w2,w3) = 0, you'll have to extend the possibilities. A generalized extension to @ThomasW's answers would be:

If the set predictor f_2(w1,w2) != 0 draw from that
If the set predictor f_1(w2) != 0 draw from that
If the set predictor f_0() != 0 draw from that

That is, draw like normally from the 3-gram set, than the 2-gram set than the 1-gram set. At the last step you'll simply be drawing a word at random weighted by it's statistical frequency.

Question 3

I believe that this is a serious problem in NLP, one without a straightforward solution. One approach is to tag the parts of speech in addition to the actual words, in order to generalize the mappings. Using parts of speech, the program can at least predict what part of speech should follow the words W2 and W3 if there is no precedent for the word sequence. "Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model." From Columbia notes.