Incorrect Result When ANTLR4 Lexer Action Invokes getText()

https://stackoverflow.com/questions/22576034

antlr4

19-06-2023
|

Question

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:

grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip; 
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*) 
  {System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z]  | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD  | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];

Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:

dkk\uzzzz

The $text of the id_token parser rule action produces this correct result:

dkk
uzzzz

i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).

However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:

dkk\u
uzzzz

Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?

EDIT:

Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.

Solution

It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow