Neat way to distinguish identifiers and variable names (ANTLR)?

https://stackoverflow.com/questions/8357860

27-10-2019
|

Question

How can we distinguish a variable name, and an identifer, in an ANTLR grammar?

VAR: ('A'..'Z')+ DIGIT*  ;
IDENT  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')*;

The piece of grammar (in ANTLR) does not work because the compiler will complain that IDENT may never be reached for some input. This seems to be a classic head-hack for compiler writers, The lexer hack

For the ANTLR users, Could you tell me your neat way to work around it? Thanks

Solution

zell wrote:

The piece of grammar (in ANTLR) does not work because the compiler will complain that IDENT may never be reached for some input.

No, that is not correct. The following grammar:

grammar T;

parse
  :  .* EOF
  ;

VAR   : ('A'..'Z')+ DIGIT*  ;
IDENT : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')*;

fragment DIGIT : '0'..'9';

does not produce any error or warning. The lexer simply creates two type of tokens:

if something starts with one or more upper case ascii letters followed by zero or more digits, a VAR is created;
if something starts with a lowercase ascii letter or underscore, followed by ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')*, a IDENT is created.

Note that therefor an IDENT can never start with an uppercase ascii letter: that will always become a VAR.

So, if you have a parser rule that looks like:

foo
  :  IDENT
  ;

and the entire input is "BAR", then there will be a parser error because the lexer will not produce a INDENT token, but a VAR token, even though the parser "asks" for a IDENT.

You must understand that no matter what the parser asks from the lexer, the lexer operates independently from the parser.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow