Question

I am a newbie to pyparsing and have been reading the examples, looking here and trying some things out. I created a grammar and provided a buffer. I do however have a heavy background in lex/yacc from the old days.

I have a general question or two.

I'm currently seeing

ParseException: Expected end of line (at char 7024), (line 213, col:2)

and then it terminates

Because of the nature of my buffer, newlines have meaning, I did:

ParserElement.setDefaultWhitespaceChars('') # <-- zero len string

Does this error mean that somewhere in my productions, I have a rule that is looking for an LineEnd() and that rule happens to somehow be 'last'?

The location it is dying is the 'end of file'. I tried using parseFile but my file contains chars > ord(127) so instead I am loading it to memory, filtering all > ord(127) chars, then calling parseString.

I tried turning on verbose_stacktrace=True for some of the elements of my grammar where I thought the problem originated.

Is there a better way to track down the exact ParserElement it is trying to recognize when an error such as this occurs? Or can I get a 'stack or most recently recognized production trace?

I didn't realize I could edit up here... My crash is this:

[centos@new-host /tmp/sample]$  ./zooparser.py 
!(zooparser.py) TEST test1: valid message type START
Ready to roll
Parsing This message: ( ignore leading>>> and trailing <<< ) >>>

ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//

<<<
Match {Group:({Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | " " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | "    " | W:(!@#$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) [Group:({LineEnd "FREE" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | "  " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | "    " | W:(!@#$...)}}) "//"})]...}) [LineEnd]... StringEnd} at loc 0(1,1)
Match Group:({Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | "  " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | "    " | W:(!@#$...)}}) "//"}) Group:({LineEnd "TOPIC" {Group:({[LineEnd]... Group:({"/" {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)}) | Group:({{{"ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ'"}... | Group:({{"0123456789"}... ":"})} {W:(abcd...) | Group:({W:(abcd...) [{W:(abcd...)}...]... W:(abcd...)})}}) | "-"}})})}... [LineEnd]... "//"})}) at loc 0(1,1)
Match Group:({[LineEnd]... "ZOO" Group:({[LineEnd]... "/" [Group:({{{W:(abcd...) | LineEnd | "://" | "  " | W:(!@#$...) | ":"}}... ["/"]...})]... {W:(abcd...) | LineEnd | "://" | "    " | W:(!@#$...)}}) "//"}) at loc 0(1,1)
Exception raised:None
Exception raised:None
Exception raised:None
Traceback (most recent call last):
  File "./zooparser.py", line 319, in <module>
    test1(pgm)
  File "./zooparser.py", line 309, in test1
    test(pgm, zooMsg, 'test1: valid message type' )
  File "./zooparser.py", line 274, in test
    tokens = zg.getTokensFromBuffer(fileName)
  File "./zooparser.py", line 219, in getTokensFromBuffer
    tokens = self.text.parseString(filteredBuffer,parseAll=True)
  File "/usr/local/lib/python2.7/site-packages/pyparsing-1.5.7-py2.7.egg/pyparsing.py", line 1006, in parseString
    raise exc
pyparsing.ParseException: Expected end of line (at char 148), (line:8, col:2)
[centos@new-host /tmp/sample]$  

source: see http://prj1.y23.org/zoo.zip

Was it helpful?

Solution

pyparsing takes a different view toward parsing than lex/yacc does. You have to let the classes do some of the work. Here's an example in your code:

    self.columnHeader = OneOrMore(self.aucc) \
                        | OneOrMore(nums) \
                        | OneOrMore(self.blankCharacter) \
                        | OneOrMore(self.specialCharacter)

You are equating OneOrMore with the '+' character of a regex. In pyparsing, this is true for ParseElements, but at the character level, pyparsing uses the Word class:

    self.columnHeader = Word(self.aucc + nums + self.blankCharacter + self.specialCharacter)

OneOrMore works with ParseElements, not characters. Look at:

    OneOrMore(nums)

nums is the string "0123456789", so OneOrMore(nums) will match "0123456789", "01234567890123456789", etc., but not "123". That is what Word is for. OneOrMore will accept a string argument, but will implicitly convert it to a Literal.

This is a fundamental difference between using pyparsing and lex/yacc, and I think is the source of much of the complexity in your code.

Some other suggestions:

Your code has some premature optimizations in it - you write:

aucc = ''.join(set([alphas.upper(),"'"]))

Assuming that this will be used for defining Words, just do:

aucc = alphas.upper() + "'"

There is no harm in having duplicate characters in aucc, Word will convert this to a set internally.

Write a BNF for what you want to parse. It does not have to be overly rigorous as you would with lex/yacc. From your samples, it looks something like:

# sample
ZOO/STATUS/FOOD ALLOCATION//
TOPIC/BIRD FEED IS RUNNING LOW//
FREE/WE HAVE DISCOVERED MOTHS INFESTED THE BIRDSEED AND IT IS NO
LONGER USABLE.//

parser :: header topicEntry+
header :: "ZOO" sep namedValue
namedValue :: uppercaseWord sep valueBody
valueBody :: (everything up to //)
topicEntry :: topicHeader topicBody
topicHeader :: "TOPIC" sep valuebody
topicBody :: freeText
freeText :: "FREE" sep valuebody
sep :: "/"

Converting to pyparsing, this looks something like:

SEP = Literal("/")
BODY_TERMINATOR = Literal("//")
FREE_,TOPIC_,ZOO_ = map(Keyword,"FREE TOPIC ZOO".split())
uppercaseWord = Word(alphas.upper())
valueBody = SkipTo(BODY_TERMINATOR) # adjust later, but okay for now...

freeText = FREE_ + SEP + valueBody

topicBody = freeText
topicHeader = TOPIC_ + SEP + valueBody
topicEntry = topicHeader + topicBody

namedValue = uppercaseWord + SEP + valueBody
zooHeader = ZOO_ + SEP + namedValue

parser = zooHeader + OneOrMore(topicEntry)

(valueBody will have to get more elaborate when you add support for '://' embedded within a value, but save that for Round 2.)

Don't make things super complicated until you get at least some simple stuff working.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top