PyParsing: Not all tokens passed to setParseAction()
Question
I'm parsing sentences like "CS 2110 or INFO 3300". I would like to output a format like:
[[("CS" 2110)], [("INFO", 3300)]]
To do this, I thought I could use setParseAction()
. However, the print
statements in statementParse()
suggest that only the last tokens are actually passed:
>>> statement.parseString("CS 2110 or INFO 3300")
Match [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] at loc 7(1,8)
string CS 2110 or INFO 3300
loc: 7
tokens: ['INFO', 3300]
Matched [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] -> ['INFO', 3300]
(['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]})
I expected all the tokens to be passed, but it's only ['INFO', 3300]
. Am I doing something wrong? Or is there another way that I can produce the desired output?
Here is the pyparsing code:
from pyparsing import *
def statementParse(str, location, tokens):
print "string %s" % str
print "loc: %s " % location
print "tokens: %s" % tokens
DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode")
COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber")
OR_CONJ = Suppress("or")
COURSE_NUMBER.setParseAction(lambda s, l, toks : int(toks[0]))
course = DEPT_CODE + COURSE_NUMBER.setResultsName("Course")
statement = course + Optional(OR_CONJ + course).setParseAction(statementParse).setDebug()
Solution
In order to keep the token bits from "CS 2110" and "INFO 3300", I suggest you wrap your definition of course in a Group:
course = Group(DEPT_CODE + COURSE_NUMBER).setResultsName("Course")
It also looks like you are charging head-on at parsing out some kind of search expression, like "x and y or z". There is some subtlety to this problem, and I suggest you check out some of the examples at the pyparsing wiki on how to build up these kinds of expressions. Otherwise you will end up with a bird's nest of Optional("or" + this)
and ZeroOrMore(
"and" + that)
pieces. As a last-ditch, you may even just use something with operatorPrecedence
, like:
DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode")
COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber")
course = Group(DEPT_CODE + COURSE_NUMBER)
courseSearch = operatorPrecedence(course,
[
("not", 1, opAssoc.RIGHT),
("and", 2, opAssoc.LEFT),
("or", 2, opAssoc.LEFT),
])
(You may have to download the latest 1.5.3 version from the SourceForge SVN for this to work.)
OTHER TIPS
Works better if you set the parse action on both course
and the Optional
(you were setting only on the Optional
!):
>>> statement = (course + Optional(OR_CONJ + course)).setParseAction(statementParse).setDebug()
>>> statement.parseString("CS 2110 or INFO 3300")
gives
Match {Re:('[A-Z]{2,}') Re:('[0-9]{4}') [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}]} at loc 0(1,1)
string CS 2110 or INFO 3300
loc: 0
tokens: ['CS', 2110, 'INFO', 3300]
Matched {Re:('[A-Z]{2,}') Re:('[0-9]{4}') [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}]} -> ['CS', 2110, 'INFO', 3300]
(['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]})
though I suspect what you actually want is to set the parse action on each course, not on the statement:
>>> statement = course + Optional(OR_CONJ + course)
>>> statement.parseString("CS 2110 or INFO 3300") Match {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} at loc 0(1,1)
string CS 2110 or INFO 3300
loc: 0
tokens: ['CS', 2110]
Matched {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} -> ['CS', 2110]
Match {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} at loc 10(1,11)
string CS 2110 or INFO 3300
loc: 10
tokens: ['INFO', 3300]
Matched {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} -> ['INFO', 3300]
(['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]})