Tokenizing left over data with lex/yacc

https://stackoverflow.com/questions/841159

20-08-2019
|

Question

Forgive me, I'm completely new to parsing and lex/yacc, and I'm probably in way over my head, but nonetheless:

I'm writing a pretty basic calculator with PLY, but it's input might not always be an equation, and I need to determine if it is or not when parsing. The extremes of the input would be something that evaluates perfectly to an equation, which it parses fine and calculates, or something that is nothing like an equation, which fails parsing and is also fine.

The gray area is an input that has equation-like parts, of which the parser will grab and work out. This isn't what I want - I need to be able to tell if parts of the string didn't get picked up and tokenized so I can throw back an error, but I have no idea how to do this.

Does anyone know how I can define, basically, a 'catch anything that's left' token? Or is there a better way I can handle this?

Solution

There is a built-in error token in yacc. You would normally do something like:



line: goodline | badline ;

badline : error '\n' /* Error-handling action, if needed */

goodline : equation '\n' ;

Any line that doesn't match equation will be handled by badline.

You might want to use yyerrok in the error handling action to ensure error processing is reset for the next line.

OTHER TIPS

Define a token (end of input), and make your lexer output it at the end of the input.

So before, if you had these tokens:

'1' 'PLUS' '1'

You'll now have:

'1' 'PLUS' '1' 'END_OF_INPUT'

Now, you can define your top-level rule in your parser. Instead of (for example):

Equation ::= EXPRESSION

You'll have

Equation ::= EXPRESSION END_OF_INPUT

Obviously you'll have to rewrite these in PLY syntax, but this should get you most of the way.

I typically use a separate 'command reader' to obtain a complete command - probably a line in your case - into a host variable string, and then arrange for the lexical analyzer to analyze the string, including telling me when it didn't reach the end. This is hard to set up, but make some classes of error reporting easier. One of the places I've used this technique routinely has multi-line commands with 3 comment conventions, two sets of quoted strings, and some other nasties to set my teeth on edge (context sensitive tokenization - yuck!).

Otherwise, Don's advice with the Yacc 'error' token is good.

It looks like you've already found a solution but I'll add another suggestion in case you or others are interested in an alternative approach.

You say you are using PLY but is that because you want the compiler to run in a Python environment? If so, you might consider other tools as well. For such jobs I often use ANTLR (http://www.antlr.org) which has a Python code generator. ANTLR has lots of tricks for doing things like eating a bunch of input at the lexer level so the parser never sees it (e.g. comments), ability to call a sub-rule (e.g. equation) within a larger grammar (which should terminate once the rule has been matched without processing any more input...sounds somewhat like what you want to do) and a very nice left-factoring algorithm.

ANTLRs parsing capability combined with the use of the StringTemplate (http://www.stringtemplate.org) engine makes a nice combination and both support Python (among many others).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow