Antlr v4 Can I ignore mismatched input?

https://stackoverflow.com/questions/22717130

23-06-2023
|

Question

I'm attempting to make a parser to recognise units of measure and then using a listener convert units as needed. However when parsing a test sentence I get a missed matched input error when the parser sees the units in other parts of the text such as in the middle of words. Here is a cut down version of my code.

UnitsOfMeasure.g4

grammar UnitsOfMeasure;

import
    ImperialUnitsParser;

/*------------------------------------------------------------------
 * UNITS OF MEASURE PARSER RULES
 *------------------------------------------------------------------*/
include_metric_units
    : imperial_types
    | include_metric_units imperial_types
    ;

imperial_types
    : i_area
    ;

i_area
    : QUANTITY square_inch
    | QUANTITY square_feet
    | QUANTITY square_mile
    | QUANTITY square_yard
    ;

/*------------------------------------------------------------------
 * UNITS OF MEASURE - LEXER RULES
 *------------------------------------------------------------------*/
SQUARE
    : [S|s]'quare'
    | [S|s]'q' '.'?
    ;

SQUARED
    : [S|s]'quared'
    | '^2'
    | '<sup>2</sup>'
    | '&#178'
    | '\u00B2'
    ;

fragment PLURAL
    : 's'  ?
    | 'es' ?
    ;

QUANTITY
    : '-'? FLOAT
    | '-'? DIGITS
    ;

FLOAT
    : DIGITS '.' DIGITS
    ;

fragment DIGITS
    : DIGIT+
    ;

fragment DIGIT
    : '0'..'9'
    ;

/*------------------------------------------------------------------
 * SKIP EVERYTHING ELSE
 *------------------------------------------------------------------*/ 
 EVERYTHING 
    : . -> skip 
    ;

ImperialUnitsParser.g4

parser grammar ImperialUnitsParser;

import ImperialUnitsLexer;

/*------------------------------------------------------------------
 * AREA
 *------------------------------------------------------------------*/
square_inch
    : SQUARE INCH
    | INCH SQUARED
    ;

/*------------------------------------------------------------------
 * LENGTH
 *------------------------------------------------------------------*/
inch
    : INCH
    ;

ImperialUnitsLexer.g4

lexer grammar ImperialUnitsLexer;

/*------------------------------------------------------------------
 * BASE UNITS
 *------------------------------------------------------------------*/
INCH
    : [I|i]'nch' PLURAL
    | [I|i]'n' '.'?
    ;

Convert.java

public static String includeMetricUnits(String parse) throws UnitsOfMeasureParserRuntimeException
{           
    StringBuilder builder = new StringBuilder(parse);

    ANTLRInputStream in = new ANTLRInputStream(builder.toString());
    UnitsOfMeasureLexer lexer = new UnitsOfMeasureLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);

    UnitsOfMeasureParser parser = new UnitsOfMeasureParser(tokens);
    parser.addParseListener(new UnitsOfMeasureParseListener(builder));
    parser.addErrorListener(new UnitsOfMeasureErrorListener());
    parser.include_metric_units(0);
    return builder.toString();
}

So the listener here does some editing of the builder as the stream is parsed. A working example of this is the following:

"A whiteboard with 1550 square inches of writing space" returns:

"A whiteboard with 1550in²(1m²) of writing space"

However when I make this a bit more complex by adding in more than one unit it reports the following:

line 1:44 mismatched input 'in' expecting {EOF, QUANTITY}

on:

"A whiteboard with 1550 square inches of writing space, and a touchscreen measuring 775 square inches" returns:

"A whiteboard with 1550in²(1m²) of writing space, and a touchscreen measuring 775 square inches"

Following the debugger it performs the first conversion without error and then drops out after it's look ahead. I probably haven't got the recursive part quite right but essentially the grammar is supposed to keep looking until it finds a quantity followed by a unit of measure. If the quantity is not followed by a recognised unit it should just ignore it and continue.

From the error I can see that it picked up the 'in' in 'writing' as I have a Lexer rule to recognise this as inches but because there is no quantity it throws an error.

Can anyone help me with this issue so that I can get the grammar to ignore inputs that don't match? And can anyone tell me if i'm getting the recursive bit right so that it continues till the end of the sentence.

Solution

When you don't want to match the token INCH when it's part of another word, you'll need to match words, and skip these:

WORD
 : [a-zA-Z]+ -> skip
 ;

Just be sure you place this rule after your INCH rule, otherwise it'd match the input "in" as a word too (which you obviously don't want). You'll also want to expand the character this rule matches: only ascii letter won't suffice.

Also, [I|i] matches the pipe char as well: do [Ii] instead.

Although correct:

include_metric_units
    : imperial_types
    | include_metric_units imperial_types
    ;

it's rather LR/Bison-esque. More readable would be to write:

include_metric_units
    : imperial_types+
    ;

And to match tokens that might be in the token stream, but are not matches by any of your productions, simply match any token in your top level rule:

parse
  :  ( include_metric_units // match metrics
     | .                    // or any "dangling" single token
     )*                     // zero or more times
     EOF                    // end of the input
  ;

include_metric_units
  :  imperial_types+
  ;

Yes, that is correct: the . (DOT) inside a production/parser rule matches a single token, not a single character. It only matches a single character in lexer rules.

When I now parse the input

A whiteboard with 1550 square inches of writing space, and 
a touchscreen measuring 775 square inches and an in at the end...

(note the 'in' at the end!), I get the following parse tree:

enter image description here

OTHER TIPS

Using a parser for a freeform language is not a good idea. What you'd rather need is a kinda keyword spotting. You look through your input, e.g. using regular expressions, for some forms of recognizable input and extract the exact values from this string subpart.

A parser needs a well defined language, that is, a language that you can put in rules in its entirety (unambiguiously). With free input text just a little different grammar, a fill word, a typo etc. will completely break your parsing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow