Question

I have some experience writing parsers with ANTLR and I am trying (for self-education :) ) to port one of them to PEG (Parsing Expression Grammar).

As I am trying to get a feel for the idea, one thing strikes me as cumbersome, to the degree that I feel I have missed someting: How to deal with whitespace.

In ANTLR, the normal way to deal with whitespace and comments were to put the tokens in a hidden channel, but with PEG grammars there is no tokenization step. Considering languages such as C or Java, where comments are allowed almost everywhere, one would like to "hide" the comments right away, but since the comments may have semantic meaning (for example when generating code documentation, class diagrams, etc), one would not just like to discard them.

So, is there a way to deal with this?

Was it helpful?

Solution

Because there is no separate tokenization phase, there is no "time" to discard certain characters (or tokens).

Since you're familiar with ANTLR, think of it like this: let's say ANTLR handles only PEG. So you only have parser rules, no lexer rules. Now how would you discard, say, spaces? (you can't).

So, the answer to you question is: you can't, you'll have to litter your grammar with space-rules in the PEG:

ANTLR

add_expr
 : Num Add Num
 ;

Add   : '+';
Num   : '0'..'9'+;
Space : ' '+ {skip();};

PEG

add_expr
 : num _ '+' _ num
 ;

num : '0'..'9'+;
_   : ' '*;

OTHER TIPS

It is possible to nest PEG parsers. The idea is that the first parsers consumes characters and feeds tokens to the second parser. The second PEG parser consumes tokens and does the real work.

Of course this means that you give up one advantage of Parsing Expression Grammar compared to other parsing schemes: The simplicity of PEG.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top