Reporting parse errors from PLY to caller of parser

Question 1

PLY has a t_error() function that you can override in your parser to do whatever you want. The example provided in the documentation prints out an error message and skips the offending character - but you could just as easily update a list of encountered parsing failures, have a threshold that stops after X amount of failures, etc. - http://www.dabeaz.com/ply/ply.html

4.9 Error handling

Finally, the t_error() function is used to handle lexing errors that occur when illegal characters are detected. In this case, the t.value attribute contains the rest of the input string that has not been tokenized. In the example, the error function was defined as follows:

# Error handling rule
def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

You can utilize this by making your parser a class and storing error state within it - this is a very crude example since you'd have to make multiple MyLexer instances, then build() them, then utilize them for parsing if you wanted multiple lexers running concurrently.

You could marry the error storage to the __hash__ of the lexer instance itself to only have to build once. I'm hazy on the details of running multiple lexer instances within one class, but really this is just to give a rough example of how you can capture and report non-fatal errors.

I've modified the simple calculator class example from Ply's documentation for this purpose.

#!/usr/bin/python

import ply.lex as lex

class MyLexer:

    errors = []

    # List of token names.   This is always required
    tokens = (
       'NUMBER',
       'PLUS',
       'MINUS',
       'TIMES',
       'DIVIDE',
       'LPAREN',
       'RPAREN',
    )

    # Regular expression rules for simple tokens
    t_PLUS    = r'\+'
    t_MINUS   = r'-'
    t_TIMES   = r'\*'
    t_DIVIDE  = r'/'
    t_LPAREN  = r'\('
    t_RPAREN  = r'\)'

    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_NUMBER(self,t):
        r'\d+'
        t.value = int(t.value)
        return t

    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        t.lexer.lineno += len(t.value)

    # A string containing ignored characters (spaces and tabs)
    t_ignore  = ' \t'

    # Error handling rule
    def t_error(self,t):
        self.errors.append("Illegal character '%s'" % t.value[0])
        t.lexer.skip(1)

    # Build the lexer
    def build(self,**kwargs):
        self.errors = []
        self.lexer = lex.lex(module=self, **kwargs)

    # Test it output
    def test(self,data):
        self.errors = []
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok

    def report(self):
        return self.errors

Usage:

# Build the lexer and try it out
m = MyLexer()
m.build()           # Build the lexer
m.test("3 + 4 + 5")     # Test it
print m.report()
m.test("3 + A + B")
print m.report()

Output:

LexToken(NUMBER,3,1,0)
LexToken(PLUS,'+',1,2)
LexToken(NUMBER,4,1,4)
LexToken(PLUS,'+',1,6)
LexToken(NUMBER,5,1,8)
[]
LexToken(NUMBER,3,1,0)
LexToken(PLUS,'+',1,2)
LexToken(PLUS,'+',1,6)
["Illegal character 'A'", "Illegal character 'B'"]

Question 2

Check out section 9.2:

9.2 Run-time Debugging

To enable run-time debugging of a parser, use the debug option to parse. This option can either be an integer (which simply turns debugging on or off) or an instance of a logger object. For example:
log = logging.getLogger()
parser.parse(input,debug=log)
If a logging object is passed, you can use its filtering level to control how much output gets generated. The INFO level is used to produce information about rule reductions. The DEBUG level will show information about the parsing stack, token shifts, and other details. The ERROR level shows information related to parsing errors.

The logging module is part of CPython's standard library.