ANTLR lexer -- can one prefer the shorter match?

https://stackoverflow.com/questions/20601856

antlr
lexer

02-09-2022
|

Frage

Here is a simple lexer grammar:

lexer grammar TextLexer;

@members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}

COMMENT: 'comment' .*? 'end' -> channel(HIDDEN);
WORD: [a-z]+ ;

WS
:   ' ' -> channel(HIDDEN)
;

For the most part, it behaves as expected, grabbing the words out of the stream, and ignoring anything bounded by comment . . . end. But not always. For example, if the input is the following:

quick brown fox commentandending

it will see that the word "commentandending" is longer than the comment "commentandend". So it comes out with a token "commentandending" rather than a token "ing".

Is there a way to change that behavior?

Lösung

This grammar will solve the problem in ANTLR4:

lexer grammar TextLexer;

COMMENT_BEGIN: 'comment' -> more,pushMode(MCOMMENT);
WORD_BEGIN: [a-z] -> more, pushMode(MWORD);

WS: ' ' -> channel(HIDDEN);

mode MCOMMENT;
COMMENT: .+? 'end'-> mode(DEFAULT_MODE);

mode MWORD;
WORD: [a-z]+ -> mode(DEFAULT_MODE);

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow