Question

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:

grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip; 
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*) 
  {System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z]  | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD  | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];

Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:

dkk\uzzzz

The $text of the id_token parser rule action produces this correct result:

dkk
uzzzz

i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).

However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:

dkk\u
uzzzz

Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?

EDIT:

Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.

Was it helpful?

Solution

It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top