Question

I'm parsing sentences like "CS 2110 or INFO 3300". I would like to output a format like:

[[("CS" 2110)], [("INFO", 3300)]]

To do this, I thought I could use setParseAction(). However, the print statements in statementParse() suggest that only the last tokens are actually passed:

>>> statement.parseString("CS 2110 or INFO 3300")
Match [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] at loc 7(1,8)
string CS 2110 or INFO 3300
loc: 7 
tokens: ['INFO', 3300]
Matched [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] -> ['INFO', 3300]
(['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]})

I expected all the tokens to be passed, but it's only ['INFO', 3300]. Am I doing something wrong? Or is there another way that I can produce the desired output?

Here is the pyparsing code:

from pyparsing import *

def statementParse(str, location, tokens):
    print "string %s" % str
    print "loc: %s " % location
    print "tokens: %s" % tokens

DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode")
COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber")

OR_CONJ = Suppress("or")

COURSE_NUMBER.setParseAction(lambda s, l, toks : int(toks[0]))

course = DEPT_CODE + COURSE_NUMBER.setResultsName("Course")

statement = course + Optional(OR_CONJ + course).setParseAction(statementParse).setDebug()
Was it helpful?

Solution

In order to keep the token bits from "CS 2110" and "INFO 3300", I suggest you wrap your definition of course in a Group:

course = Group(DEPT_CODE + COURSE_NUMBER).setResultsName("Course")

It also looks like you are charging head-on at parsing out some kind of search expression, like "x and y or z". There is some subtlety to this problem, and I suggest you check out some of the examples at the pyparsing wiki on how to build up these kinds of expressions. Otherwise you will end up with a bird's nest of Optional("or" + this) and ZeroOrMore( "and" + that) pieces. As a last-ditch, you may even just use something with operatorPrecedence, like:

DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode")        
COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber")
course = Group(DEPT_CODE + COURSE_NUMBER)

courseSearch = operatorPrecedence(course, 
    [
    ("not", 1, opAssoc.RIGHT),
    ("and", 2, opAssoc.LEFT),
    ("or", 2, opAssoc.LEFT),
    ])

(You may have to download the latest 1.5.3 version from the SourceForge SVN for this to work.)

OTHER TIPS

Works better if you set the parse action on both course and the Optional (you were setting only on the Optional!):

>>> statement = (course + Optional(OR_CONJ + course)).setParseAction(statementParse).setDebug()
>>> statement.parseString("CS 2110 or INFO 3300")    

gives

Match {Re:('[A-Z]{2,}') Re:('[0-9]{4}') [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}]} at loc 0(1,1)
string CS 2110 or INFO 3300
loc: 0 
tokens: ['CS', 2110, 'INFO', 3300]
Matched {Re:('[A-Z]{2,}') Re:('[0-9]{4}') [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}]} -> ['CS', 2110, 'INFO', 3300]
(['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]})

though I suspect what you actually want is to set the parse action on each course, not on the statement:

>>> statement = course + Optional(OR_CONJ + course)
>>> statement.parseString("CS 2110 or INFO 3300")                               Match {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} at loc 0(1,1)
string CS 2110 or INFO 3300
loc: 0 
tokens: ['CS', 2110]
Matched {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} -> ['CS', 2110]
Match {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} at loc 10(1,11)
string CS 2110 or INFO 3300
loc: 10 
tokens: ['INFO', 3300]
Matched {Re:('[A-Z]{2,}') Re:('[0-9]{4}')} -> ['INFO', 3300]
(['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]})
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top