Neat way to distinguish identifiers and variable names (ANTLR)?
-
27-10-2019 - |
Question
How can we distinguish a variable name, and an identifer, in an ANTLR grammar?
VAR: ('A'..'Z')+ DIGIT* ;
IDENT : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')*;
The piece of grammar (in ANTLR) does not work because the compiler will complain that IDENT may never be reached for some input. This seems to be a classic head-hack for compiler writers, The lexer hack
For the ANTLR users, Could you tell me your neat way to work around it? Thanks
Solution
zell wrote:
The piece of grammar (in ANTLR) does not work because the compiler will complain that IDENT may never be reached for some input.
No, that is not correct. The following grammar:
grammar T;
parse
: .* EOF
;
VAR : ('A'..'Z')+ DIGIT* ;
IDENT : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')*;
fragment DIGIT : '0'..'9';
does not produce any error or warning. The lexer simply creates two type of tokens:
- if something starts with one or more upper case ascii letters followed by zero or more digits, a
VAR
is created; - if something starts with a lowercase ascii letter or underscore, followed by
('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')*
, aIDENT
is created.
Note that therefor an IDENT
can never start with an uppercase ascii letter: that will always become a VAR
.
So, if you have a parser rule that looks like:
foo
: IDENT
;
and the entire input is "BAR"
, then there will be a parser error because the lexer will not produce a INDENT
token, but a VAR
token, even though the parser "asks" for a IDENT
.
You must understand that no matter what the parser asks from the lexer, the lexer operates independently from the parser.