Unexpected ANTLR4 optimization with INT and FLOAT?

https://stackoverflow.com/questions/23411436

13-07-2023
|

Question

I am writing a parser for the output of Clasp with ANTLR 4. The typical output is like the following:

clasp version 3.0.3
Reading from stdin
Solving...
Answer: 1
bird(a) bird(b) bird(c) penguin(d) bird(d)
Optimization: 7 0
Answer: 2
bird(a) bird(b) bird(c) penguin(d) bird(d) flies_abd(b) flies(b)
Optimization: 6 5
Answer: 3
bird(a) bird(b) bird(c) penguin(d) bird(d) flies_abd(c) flies(c)
Optimization: 2 5
Answer: 4
bird(a) bird(b) bird(c) penguin(d) bird(d) flies_abd(a) flies_abd(c) flies(a) flies(c)
Optimization: 1 10
Answer: 5
bird(a) bird(b) bird(c) penguin(d) bird(d) flies_abd(a) flies_abd(b) flies_abd(c) flies(a) flies(b) flies(c)
Optimization: 0 15
OPTIMUM FOUND

Models       : 5     
  Optimum    : yes
Optimization : 0 15
Calls        : 1
Time         : 0.002s (Solving: 0.00s 1st Model: 0.00s Unsat: 0.00s)
CPU Time     : 0.000s

I have to check that clasp is version 3 so I am writing a grammar like the following:

/**
 * Define a grammar for Clasp 3's output.
 */
grammar Output;

@header {package ac.bristol.clasp.parser;}

output:
    version source solving answer* result separation statistics NEWLINE* EOF;

version: 'clasp version 3.' INT '.' INT NEWLINE;

source: 'Reading from stdin' NEWLINE # sourceSTDIN
    | 'Reading from ' path NEWLINE # sourceFile;

path:
    DRIVE? folder ( BSLASH folder )* filename # pathWindows
    | FSLASH? folder ( FSLASH folder )* filename # pathNIX;

folder:
    LETTER+ # genericFolder
    | DOTDOT # parentFolder
    | DOT # currentFolder;

solving: 'Solving...' NEWLINE;

filename:
    LETTER+ extension?;

extension:
    DOT LETTER*;

answer: 'Answer: ' INT NEWLINE // 
    model? NEWLINE // 
    'Optimization: ' INT ( SPACE INT )* NEWLINE;

model:
    fact ( SPACE fact )*;

fact:
    groundPredicate;

groundTermList:
    groundTerm ( COMMA groundTerm )*;

groundTerm:
    groundCompound | STRING | number | atom; // literal?

groundCompound:
    groundPredicate
    | groundExpression;

groundPredicate:
    IDENTIFIER ( LROUND groundTermList RROUND )?;

groundExpression:
    groundBits AND groundBits
    | groundBits OR groundBits
    | groundBits XOR groundBits;

groundBits:
    groundCompare GT groundCompare
    | groundCompare GE groundCompare
    | groundCompare LT groundCompare
    | groundCompare LE groundCompare;

groundCompare:
    groundItem EQ groundItem
    | groundItem NE groundItem;

groundItem:
    groundFactor PLUS groundFactor
    | groundFactor MINUS groundFactor;

groundFactor:
    groundUnary TIMES groundUnary
    | groundUnary DIVIDE groundUnary
    | groundUnary MOD groundUnary;

groundUnary:
    TILDE groundTerm
    | MINUS groundTerm;

atom:
    IDENTIFIER
    | QUOTED;

number:
    INT
    | FLOAT;

//------------------------------------------------------------------------------

result: 'OPTIMUM FOUND' NEWLINE
    | 'SATISFIABLE' NEWLINE
    | 'UNKNOWN' NEWLINE;

separation:
    NEWLINE;

statistics:
    models optimum? optimization calls time cputime;

models: 'Models       : ' INT SPACE* NEWLINE;

optimum: '  Optimum    : yes' NEWLINE
    | '  Optimum    : no' NEWLINE;

optimization: 'Optimization : ' INT ( SPACE INT )* NEWLINE;
calls: 'Calls        : ' INT NEWLINE;
time: 'Time         : ' FLOAT 's (Solving: ' FLOAT 's 1st Model: ' FLOAT 's Unsat: ' FLOAT 's)' NEWLINE;
cputime: 'CPU Time     : ' FLOAT 's';

//------------------------------------------------------------------------------

AND:       '&';
BSLASH:    '\\';
COLON:     ':';
COMMA:     ',';
DIVIDE:    '/';
DOT:       '.';
DOTDOT:    '..';
EQ:        '==';
FSLASH:    '/';
GE:        '>=';
GT:        '>';
LE:        '<=';
LROUND:    '(';
LT:        '<';
MINUS:     '-';
MOD:       '%';
NE:        '!=';
OR:        '?';
PLUS:      '+';
RROUND:    ')';
SEMICOLON: ';';
SPACE:     ' ';
TILDE:     '~';
TIMES:     '*';
XOR:       '^';

DRIVE:      ( LOWER | UPPER ) COLON BSLASH?;
IDENTIFIER: LOWER FOLLOW*;
INT:        DIGIT+;
FLOAT:      DIGIT+ DOT DIGIT+;
NEWLINE:    '\r'? '\n';
QUOTED:     '\'' ( ~[\'\\] | ESCAPE )+? '\'';
STRING:     '"' ( ~["\\] | ESCAPE )+? '"';

fragment DIGIT:      [0] | NONZERO;
fragment ESCAPE:     '\\' [btnr"\\] | '\\' [0-3]? [0-7]? [0-7] | '\\' 'u' [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F];
fragment FOLLOW:     LOWER | UPPER | DIGIT | UNDERSCORE;
fragment LETTER:     LOWER | UPPER | DIGIT | SPACE;
fragment LOWER:      [a-z];
fragment NONZERO:    [1-9];
fragment UNDERSCORE: [_];
fragment UPPER:      [A-Z];

Notice that there is no rule to skip some parts of the input stream because I want to check every single character. Also notice that I have a terminal rule for INTever and one for FLOAT, INT is defined before FLOAT, FLOATs are defined like in Prolog.

The rule that parses the first line of the above example is the following:

version: 'clasp version 3.' INT '.' INT NEWLINE;

because it I have to check that the clasp major version number being used is 3, than I have to consume the rest of the line reading the minor version number, a dot, the build number and the newline (without spaces or whatsoever). Unfortunately, I get the following warning message, that makes me think that ANTLR is recognizing the minor version number, the dot and the build number as a FLOAT:

line 1:16 mismatched input '0.3' expecting INT

Could you please explain me what is going on?
Am I assuming something that I shouldn't?
Or is it ANTLR that is applying an unneeded optimization?

Solution

ANTLR breaks your input into tokens, and only after that parses the tokens. Your use of 'clasp version 3.' in a parser rule implicitly defines an anonymous token that matches that string of text. The text following that token starts with 0.0, which matches a float. The lexer has no idea that the parser will be in the version rule at that point; it merely chooses the longest token starting at the current position, and 0.0 as a FLOAT is longer than 0 as an INT. I recommend the following:

Separate your grammar into a parser grammar OutputParser; and a lexer grammar OutputLexer; In your parser grammar, use the tokenVocab option to indicate which lexer defines your tokens. This separation will force you to define real tokens for everything the grammar is using.
```
options {
  tokenVocab = OutputLexer;
}
```
Either use a FLOAT instead of INT '.' INT, or create a new token to represent a version:
```
VERSION
  : DIGIT+ DOT DIGIT+ DOT DIGIT+
  ;
```

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow