ANTLR lexer rule consumes too much

https://stackoverflow.com/questions/23601038

20-07-2023
|

Question

ANTLR Lexer Rule Design

I have a requirement for the following token:

Allowable characters include uppercase, lowercase, numeric, space, and hyphen characters
Unfixed length (must be at least two characters in length)
Token must contain at least one space or hyphen
Token must start and end in an uppercase, lowercase, numeric, space, or hyphen character (cannot begin or end with a space)

The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using the parser rule "sic" to test, the following input will parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION[4400]"

The following input fails to parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION [4400]"

The issue being that the lexer rule "AlphaNumericSpaceHyphen" consumes the space and the left square bracket after "WATER TRANSPORTATION" before the lexer realizes that there is no match because it went too far.

I have experimented with various type of predicates and look aheads without any luck. Any help is greatly appreciated.

grammar T;

sic: SICSpecifier AlphaNumericSpaceHyphen  LEFTBRACKET Digits RIGHTBRACKET;

LEFTBRACKET  
:   '[';  

RIGHTBRACKET 
:   ']';

SICSpecifier: 'STANDARD INDUSTRIAL CLASSIFICATION:';

WS : (' '|'\t')+ 
{   
  $channel = HIDDEN;  
};  

fragment UCASEALPHA : 'A'..'Z';
fragment LCASEALPHA : 'a'..'z';
fragment DIGIT : '0'..'9';
Digits: DIGIT+;

AlphaNumericSpaceHyphen 
:           (UCASEALPHA|LCASEALPHA |DIGIT|'-')+  (' ' (UCASEALPHA|LCASEALPHA |DIGIT|'-')+)+   
        |   (UCASEALPHA|LCASEALPHA |DIGIT)+ ('-')+  ((' '|UCASEALPHA|LCASEALPHA |DIGIT|'-')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?
        |   ('-')+ (UCASEALPHA|LCASEALPHA |DIGIT)+  ((UCASEALPHA|LCASEALPHA |DIGIT|'-'|' ')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?   
        ;

Solution

Unfortunately there is no backtracking for the lexer rules. You can take a look at

ANTLR lexer rule consumes characters even if not matched?

You can try to adapt your grammar so that you can change the type of the token as it is suggested in this solution.

Hope this is going to help you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow