Dealing with overloaded symbols in ambiguous grammars in ANTLR4

https://stackoverflow.com//questions/23062131

26-12-2019
|

Question

I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions. One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.

In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.

Could somebody please tell me what to do to overcome this ambiguity? Please notice that I cannot change the language because it is quite standard. In order to simplify the experts' work, I created a minimal working example that follows.

grammar Test;

program: 
  statement* ;

statement: // DOT is the statement terminator
  range DOT |
  intNum DOT |
  floatNum DOT ;

intNum: // not needed, but helps in TestRig
  INT;

floatNum: // not needed, but helps in TestRig
  FLOAT;

range: // defines an expansion
  INT DOTS INT ;

DOTS: '..';
DOT: '.';

FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;

WS: [ \t\r\n]+ -> skip ;

fragment NONZERO     : [1-9] ;
fragment DIGIT       : [0] | NONZERO ;

I use the following input:

1 .
1. .
1.5 .
.5 .
1 .. 5 .

1.
1..
1.5.
.5.
1..5.

And I get the following errors which instead are parsed corrected by other tools:

line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'

Many thanks in advance!

Solution

Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.

OTHER TIPS

Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):

TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';

where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an @member block in your grammar:

public boolean isTerminal(int la) {
    int offset = _tokenStartCharIndex + 1 + la;
    String s = _input.getText(Interval.of(offset, offset));
    if (Character.isWhitespace(s.charAt(0))) {
        return true;
    }
    return false;
}

May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.

I recommend shifting the work to the parser.

If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.

Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.

Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?

The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).

grammar Test;

program: 
  statement* ;

statement: // DOT is the statement terminator
  range DOT |
  intNum DOT |
  floatNum DOT ;

intNum: // not needed, but helps in TestRig
  INT;

floatNum: 
    INT DOT INT? | DOT INT ;

range: // defines an expansion
  INT dots INT ;

dots : DOT DOT;  

DOT: '.';

INT: DIGIT+ ;

WS: [ \t\r\n]+ -> skip ;

fragment NONZERO     : [1-9] ;
fragment DIGIT       : [0] | NONZERO ;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow