After a lot of tweaking around I can make it work but not sure why it has to be done this way:
from pyparsing import *
from string import whitespace
def test(phrase):
"""
@summary: try to grab a keyword+ "keyword+ " and free text following
the keyword
@param phrase: a phrase of text to parse
@type phrase: str
@date: 20140213
"""
test = 1
print 'Phrase \n 1 2 3\n'
print '123456789012345678901234567890\n'
print '%s\n' % phrase
kw = Combine(Word(alphas + nums) + Literal(':'))('KEY')
punc = printables.replace(':', '')
p2 = oneOf(")[]()/.")
# but punc now has '/' in it twice
p3 = punc | p2
kw.setDebug(True)
s = OneOrMore(p3)
# pepper handles body1
# http://structure.usc.edu/pyparsing/pyparsing.Word-class.html
# init char, body chars
body1 = originalTextFor(OneOrMore(~kw + OneOrMore(Word(alphas +
'/.' + nums))))('BODY1')
body2 = originalTextFor(OneOrMore(~kw + (Word(alphas + punc + nums))))(
'BODY2')
body1.setDebug(True)
body1.setName('BODY1')
body2.setDebug(True)
body2.setName('BODY2')
grammar = OneOrMore(Group(kw + body1) | Group(kw + body2))
print '============= %s =================' % test
# this grabs only the first one
print ("grammar %s" % grammar)
output = grammar.parseString(phrase)
print 'XXXXXXXXXXXXX'
print 'XXXXXXXXXXXXX'
print 'XXXXXXXXXXXXX'
result, start, end = next(grammar.scanString(phrase))
print len(phrase), end
print 'NOTICE:'
print phrase[end:end+10]
print ("Test %d output %s" % (test, output))
for res in output:
print res.dump()
if __name__ == '__main__':
phrase = """
COTTON2: (RAW) NEED HARVEST DATE.
SALAMI2: (COOKED) SOUTHERN VARIES; SUGGEST ALT.
PEPPER1: ON TREE/ROOTS UNDERGROUND REQUEST PERMISSION
TO DIG PLANT AND RELOCATE.
"""
# when I run the output stops at '/' in the 'pepper' parsing.
test(phrase)
I relabeled the cotton, salami, pepper with numbers to show which rule they trigger. for example Cotton seems to trigger Body2 and Salami seems to trigger body2 and Pepper triggers body1. I don't really like this solution as it is a bit hardcoded. And it doesn't make sense to me.
When I run it I get the following output: (it doesn't seem to like the EOF condition)
Exception raised:Expected W:(abcd...) (at char 170), (line:11, col:1)
170 169
NOTICE:
Test 1 output [['COTTON2:', '(RAW) NEED HARVEST DATE.'], ['SALAMI2:', '(COOKED) SOUTHERN VARIES; SUGGEST ALT.'], ['PEPPER1:', 'ON TREE/ROOTS UNDERGROUND REQUEST PERMISSION\nTO DIG PLANT AND RELOCATE.']]
['COTTON2:', '(RAW) NEED HARVEST DATE.']
- BODY2: (RAW) NEED HARVEST DATE.
- KEY: COTTON2:
['SALAMI2:', '(COOKED) SOUTHERN VARIES; SUGGEST ALT.']
- BODY2: (COOKED) SOUTHERN VARIES; SUGGEST ALT.
- KEY: SALAMI2:
['PEPPER1:', 'ON TREE/ROOTS UNDERGROUND REQUEST PERMISSION\nTO DIG PLANT AND RELOCATE.']
- BODY1: ON TREE/ROOTS UNDERGROUND REQUEST PERMISSION
TO DIG PLANT AND RELOCATE.
- KEY: PEPPER1:
Process finished with exit code 0
But it does process all of the input, including the embedded '/' that it was bombing on earlier.
so there is still a bit of a question as to what is going on with the lookahead rule and embedded 'freetext rule' on body1 and body2.
body1 = originalTextFor(OneOrMore(~kw + OneOrMore(Word(alphas +
'/.' + nums))))('BODY1')
body2 = originalTextFor(OneOrMore(~kw + (Word(alphas + punc + nums))))(
'BODY2')
On a whim I just threw in the '/.' into body1, since PEPPER seemed to be being handled by body1. I tried throwing punc into body1 but that didn't work. -- it actually made things worse.
however the above solution (source code function in full) works.