Matching lexeme variants with Antlr3

https://stackoverflow.com/questions/3817958

26-09-2019
|

Question

I'm trying to match measurements in English input text, using Antlr 3.2 and Java1.6. I've got lexical rules like the following:

fragment
MILLIMETRE
    :   'millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm'
    ;

MEASUREMENT
    :   MILLIMETRE | CENTIMETRE | ... ;

I'd like to be able to accept any combination of upper- and lowercase input and - more importantly - just return a single lexical token for all the variants of MILLIMETRE. But at the moment, my AST contains 'millimetre', 'millimeters', 'mm' etc. just as in the input text.

After reading http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308, I think I need to do something like the following:

tokens {
    T_MILLIMETRE;
}

fragment
MILLIMETRE
    :   ('millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm') { $type = T_MILLIMETRE; }
    ;

However, when I do this, I get the following compiler errors in the Java code generated by Antlr:

cannot find symbol
_type = T_MILLIMETRE;

I tried the following instead:

MEASUREMENT
    :   MILLIMETRE  { $type = T_MILLIMETRE; }
    |   ...

but then MEASUREMENT is not matched anymore.

The more obvious solution with a rewrite rule:

MEASUREMENT
    :   MILLIMETRE  -> ^(T_MILLIMETRE MILLIMETRE)
    |   ...

causes an NPE:

java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).

Making MEASUREMENT into a parser rule gives me the dreaded "The following token definitions can never be matched because prior tokens match the same input" error.

By creating a parser rule

measurement :  T_MILLIMETRE | ...

I get the warning "no lexer rule corresponding to token: T_MILLIMETRE". Antlr runs though, but it still gives me the input text in the AST and not T_MILLIMETRE.

I'm obviously not yet seeing the world the way Antlr does. Can anyone give me any hints or advice please?

Steve

Solution

Here's a way to do that:

grammar Measurement;

options {
  output=AST;
}

tokens {
  ROOT;
  MM;
  CM;
}

parse
  :  measurement+ EOF -> ^(ROOT measurement+)
  ;

measurement
  :  Number MilliMeter -> ^(MM Number)
  |  Number CentiMeter -> ^(CM Number)
  ;

Number
  :  '0'..'9'+
  ;

MilliMeter
  :  'millimetre'
  |  'millimetres'
  |  'millimeter'
  |  'millimeters'
  |  'mm'
  ;

CentiMeter
  :  'centimetre'
  |  'centimetres'
  |  'centimeter'
  |  'centimeters'
  |  'cm'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

It can be tested with the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("12 millimeters 3 mm 456 cm");
        MeasurementLexer lexer = new MeasurementLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MeasurementParser parser = new MeasurementParser(tokens);
        MeasurementParser.parse_return returnValue = parser.parse();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

which produces the following DOT file:

digraph {

    ordering=out;
    ranksep=.4;
    bgcolor="lightgrey"; node [shape=box, fixedsize=false, fontsize=12, fontname="Helvetica-bold", fontcolor="blue"
        width=.25, height=.25, color="black", fillcolor="white", style="filled, solid, bold"];
    edge [arrowsize=.5, color="black", style="bold"]

  n0 [label="ROOT"];
  n1 [label="MM"];
  n1 [label="MM"];
  n2 [label="12"];
  n3 [label="MM"];
  n3 [label="MM"];
  n4 [label="3"];
  n5 [label="CM"];
  n5 [label="CM"];
  n6 [label="456"];

  n0 -> n1 // "ROOT" -> "MM"
  n1 -> n2 // "MM" -> "12"
  n0 -> n3 // "ROOT" -> "MM"
  n3 -> n4 // "MM" -> "3"
  n0 -> n5 // "ROOT" -> "CM"
  n5 -> n6 // "CM" -> "456"

}

which corresponds to the tree:

alt text

(image created by http://graph.gafol.net/)

EDIT

Note that the following:

measurement
  :  Number m=MilliMeter {System.out.println($m.getType() == MeasurementParser.MilliMeter);}
  |  Number CentiMeter
  ;

will always print true, regardless if the "contents" of the (millimeter) tokens are mm, millimetre, millimetres, ...

OTHER TIPS

Note that fragment rules only "live" inside the lexer and cease to exist in the parser. For example:

grammar Measurement;

options {
  output=AST;
}

parse
  :  (m=MEASUREMENT {
       String contents = $m.text;
       boolean isMeasurementType = $m.getType() == MeasurementParser.MEASUREMENT;
       System.out.println("contents="+contents+", isMeasurementType="+isMeasurementType);
     })+ EOF
  ;

MEASUREMENT
  :  MILLIMETRE
  ;

fragment
MILLIMETRE
  :  'millimetre' 
  |  'millimetres'
  |  'millimeter' 
  |  'millimeters'
  |  'mm'
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

with input text:

"millimeters mm"

will print:

contents=millimeters, isMeasurementType=true
contents=mm, isMeasurementType=true

in other words: the type MILLIMETRE does not exist, they're all of type MEASUREMENT.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow