getting a grammar to read more than one keyword in the text

Question

I'm good with TDD, but here your whole testing and alternative-selecting infrastructure really gets in the way of seeing just where the grammar is and what's going on with it. If I strip away all the extra machinery, I see your grammar is just:

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
g1 = OneOrMore(Group(kw + body1))

The first issue I see is your definition of body1:

body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')

You are on the right track with a negative lookahead, but for it to work in pyparsing, you have to put it at the beginning of the expression, not at the end. Think of it as "before I match another valid word, I will first rule out that it is a keyword.":

body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')

(Why is this a delimitedList, by the way? delimitedList is usually reserved for true lists with delimiters, such as comma-delimited arguments to a program function. All this does is accept any commas that might be mixed into the body, which should be handled more straightforwardly using a list of punctuation.)

Here my test version of your code:

from pyparsing import *

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = OneOrMore(~kw + Word(alphas + nums))('Body')
g1 = OneOrMore(Group(kw + body1))

msg = [  """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
    """,
             '',
          ][0]

result = g1.parseString(msg)
# we expect multiple groups, each containing "KEY" and "Body" names,
# so iterate over groups, and dump the contents of each
for res in result:
    print res.dump()

I still get the same results as you, just the first keyword matches. So to see where the disconnect is happening, I use scanString, which returns not only the matched tokens, but also the start and end of the matched tokens:

result,start,end = next(g1.scanString(msg))
print len(msg),end

Which gives me:

320 161

So I see that we are ending at location 161 in a string whose total length is 320, so I'll add one more print statement:

print msg[end:end+10]

and I get:

.
lastly;

The trailing period in your body text is the culprit. If I remove that from message and try parseString again, I now get:

['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- KEY: NOW;
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- KEY: lastly;

If you want to handle punctuation, I suggest you add something like:

PUNC = oneOf(". , ? ! : & $")

and add it to body1:

body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')