I'm good with TDD, but here your whole testing and alternative-selecting infrastructure really gets in the way of seeing just where the grammar is and what's going on with it. If I strip away all the extra machinery, I see your grammar is just:
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
g1 = OneOrMore(Group(kw + body1))
The first issue I see is your definition of body1:
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
You are on the right track with a negative lookahead, but for it to work in pyparsing, you have to put it at the beginning of the expression, not at the end. Think of it as "before I match another valid word, I will first rule out that it is a keyword.":
body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')
(Why is this a delimitedList
, by the way? delimitedList
is usually reserved for true lists with delimiters, such as comma-delimited arguments to a program function. All this does is accept any commas that might be mixed into the body, which should be handled more straightforwardly using a list of punctuation.)
Here my test version of your code:
from pyparsing import *
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = OneOrMore(~kw + Word(alphas + nums))('Body')
g1 = OneOrMore(Group(kw + body1))
msg = [ """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
""",
'',
][0]
result = g1.parseString(msg)
# we expect multiple groups, each containing "KEY" and "Body" names,
# so iterate over groups, and dump the contents of each
for res in result:
print res.dump()
I still get the same results as you, just the first keyword matches. So to see where the disconnect is happening, I use scanString
, which returns not only the matched tokens, but also the start and end of the matched tokens:
result,start,end = next(g1.scanString(msg))
print len(msg),end
Which gives me:
320 161
So I see that we are ending at location 161 in a string whose total length is 320, so I'll add one more print statement:
print msg[end:end+10]
and I get:
.
lastly;
The trailing period in your body text is the culprit. If I remove that from message and try parseString
again, I now get:
['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- KEY: NOW;
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- KEY: lastly;
If you want to handle punctuation, I suggest you add something like:
PUNC = oneOf(". , ? ! : & $")
and add it to body1:
body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')