Lexers w/ Phrase Tokens

https://stackoverflow.com/questions/23255740

antlr4

08-07-2023
|

Question

I'm experimenting w/ ANTL4 on a grammar that would best be tokenized into phrases rather than words (i.e., most of the tokens may contain spaces). In some cases, however, I want to capture specific substring phrases as individual tokens. Consider the following example:

Occurrence A of Encounter Performed

The phrase "Occurrence A of" is special-- whenever I see it, I want to pull it out. The rest of the statement ("Encounter Performed") is fairly arbitrary and for the purposes of this example, could be anything.

For this example, I've whipped up this quick grammar:

grammar test;

stat: OCCURRENCE PHRASE;

OCCURRENCE: 'Occurrence' LABEL 'of' ;
fragment LABEL: [A-Z] ;
PHRASE: (WORD ' ')* WORD ;
fragment WORD: [a-zA-Z\-]+ ;
WS: [ \t\n\r]+ -> skip ;

If I test it against the statement above, it fails ("line 1:0 missing OCCURRENCE at 'Occurrence A of Encounter Performed'"). I believe this is because the lexer will match on the token that can consume the most consecutive characters (PHRASE, in this case).

So... I understand the problem-- I'm just not clear on the best solution. Is it possible? Or do I need to just live with a lexer that matches on word boundaries and a parser that puts them together into phrases? I prefer doing it in the lexer because the phrase (like "Encounter Performed") is really intended to be a single unit.

I'm new to ANTLR (and lexers/parsers in general), so please forgive me if the solution is easy! So far, however, I haven't been able to find an answer. Thanks for your help!

Solution

While there is a way to do what you wish in the lexer**, on such a simple grammar it is unlikely to be worth the effort. Also, by packing it all into a single token, you set yourself up to being forced eventually to manually dig around in the token string just to pick out the value of the LABEL.

You can still define semantically appropriate rules -- rules that reflect the what you consider to be 'tokens' -- just as simple, 'lower level' parser rules:

stat: occurrence phrase ;

occurrence: OCCURRENCE label=WORD OF ; 
phrase: WORD+ ; 

OCCURRENCE: 'Occurrence' ;
OF: 'of' ;
WORD: [a-zA-Z\-]+ ;
WS: [ \t\n\r]+ -> skip ;

** If you really want to, you can implement a lexer mode and, using the 'more' operator, consume the OCCURRENCE... string into a single token. This is untested -- I think "more" will work as shown, but if not you will need to pack the token text yourself. In any event, it illustrates the potential complexity of what you stated you wished to do.

OCCURRENCE: 'Occurrence' -> pushMode(stuff), more ;

mode stuff ;

OF: 'of' -> popMode, more ;
OTHER: . -> more ;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow