ANTLR v4: Same character has different meaning in different contexts

https://stackoverflow.com/questions/18136624

24-06-2022
|

Frage

This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:

A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
- ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
- ex.: (@)
Each encoded entity will be separated by whitespace

So I could encode the following sentence:

ABC a@b.com

as (with corresponding letters shown underneath):

^.- ^-... ^-.-. ( ) ._ (@) -... (.) -.-. --- --
 A   B     C    ' ' a  '@' b    '.' c    o   m

Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.

There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".

Here is the grammar I have got so far:

grammar MorseCode;

file: entity*;

entity:
      special
    | morse_char;

special: '(' SPECIAL ')';

morse_char: '^'? (DOT_OR_DASH)+;

SPECIAL     : .; // match any character
DOT_OR_DASH : ('.' | '-');

WS          : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)

When I try it against the following input:

^... --- ...(@)

I get the following output (from grun ... -tokens):

[@0,0:0='^',<1>,1:0]
[@1,1:1='.',<4>,1:1]
...
[@15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH

It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?

Lösung

It seems like your (@) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:

SPECIAL : '(' .*? ')';

To ensure that . . and .. are actually different, you can use this:

SYMBOL : [.-]+;

Then you can define your ^ operator:

CARET : '^';

With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:

file
  : entity* EOF
  ;

entity
  : morse_char
  | SPECIAL
  ;

morse_char
  : CARET? SYMBOL
  ;

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow