ANTLR4 Accepting additional tokens as valid?

https://stackoverflow.com/questions/22305340

12-06-2023
|

Question

I'm building a small rule language to test and get used to ANTLR. I'm using ANTLR V4 and I have the following grammar split as follows:

Lexer.g4

lexer grammar Lexer;

/*------------------------------------------------------------------
 * LEXER RULES - GENERIC KEYWORDS
 *------------------------------------------------------------------*/
NOT
    : 'not'
    ;

NULL
    : 'null'
    ;

AND
    : 'and'
    | '&'
    ;

/*------------------------------------------------------------------
 * LEXER RULES - PATTERN MATCHING
 *------------------------------------------------------------------*/
DELIM
    : [\|\\/:,&@+><^]
    ;

WS 
    : [ \t\r\n]+ -> skip 
    ;

VALUE 
    : SQUOTE TEXT SQUOTE
    ;

fragment SQUOTE
    : '\'' 
    ;

fragment TEXT 
    : ( 'a'..'z' 
      | 'A'..'Z'
      | '0'..'9'
      | '-'
      )+ ;

Attribute.g4

grammar Attribute;

/*------------------------------------------------------------------
 * Semantic Predicate
 *
 * Attributes are capitalised words that may have spaces.  They're 
 * loaded from the database and and set in the glue code so that
 * they can be cross checked here.  If the grammar passed in sees
 * an attribute it will pass so long as the attribute is in the 
 * database, otherwise the grammar will fail to parse.
 *------------------------------------------------------------------*/  
attr
    : a=ATTR {attributes.contains($a.text)}?
    ;

ATTR
    : ([A-Z][a-zA-Z0-9/]+([ ][A-Z][a-zA-Z0-9/]+)?)
    ;

ReplaceInWith.g4

grammar ReplaceInWith;

/*------------------------------------------------------------------
 * REPLACE IN WITH PARSER RULES
 *------------------------------------------------------------------*/
replace_in_with
    : rep in with {row.put($in.value    , $in.value.replace($rep.value, $with.value));}
    | repAtt with {row.put($repAtt.value, $with.value);}
    ;

rep returns[String value]
    : REPLACE v=VALUE {$value = trimQuotes($v.text);}
    ;

repAtt returns[String value]
    : REPLACE a=attr  {$value = $a.text;}
    ;

in returns[String value]
    : IN a=attr {$value = $a.text;}
    ;

with returns[String value]
    : WITH v=VALUE {$value = trimQuotes($v.text);}
    ;

/*------------------------------------------------------------------
 * LEXER RULES - KEYWORDS
 *------------------------------------------------------------------*/
REPLACE
    : 'rep'
    | 'replace'
    ;

IN
    : 'in'
    ;

WITH
    : 'with'
    ;

Parser.g4

grammar Parser;

/*------------------------------------------------------------------
 * IMPORTED RULES
 *------------------------------------------------------------------*/
 import //Essential imports
    Attribute,
    GlueCode,
    Lexer,

    //Actual Rules
    ReplaceInWith,

/*------------------------------------------------------------------
 * PARSER RULES
 * MUST ADD EACH TOP LEVEL RULE HERE FOR IT TO BE CALLABLE
 *------------------------------------------------------------------*/
eval
    : replace_in_with
    ;

GlueCode.g4

Java to supply static calling functionality to the grammar and to set the attributes up from the database.

ParserErrorListener.java

public class ParserErrorListener extends ParserBaseListener 
{
    /**
     * After every rule check to see if an exception was thrown, if so exit with a runtime exception to indicate a 
     * parser problem.<p>
     */
    @Override 
    public void exitEveryRule(@NotNull ParserRuleContext ctx) 
    { 
        super.exitEveryRule(ctx);

        if (ctx.exception != null)
        {
            throw new ParserRuntimeException(String.format("Error evaluating expression(s) '%s'", ctx.exception));
        } //if
    } //exitEveryRule
} //class

When I supply the following to the grammar it passes as expected:

"replace 'Acme' in Name with 'acme'",
"rep 'Acme' in Name with 'acme'",
"replace 'Acme' in Name with 'ACME'",
"rep 'Acme' in Name with 'ACME'",
"replace 'e' in Name with 'i'",
"rep 'e' in Name with 'i'",

"replace '-' in Number with ' '",
"rep '-' in Number with ' '",
"replace '555' in Number with '00555'",
"rep '555' in Number with '00555'"

Where NAME and NUMBER are setup as attributes for the semantic predicate.

However when I pass in the following statement the grammar still passes but I'm not sure why it matches:

"replace any 'Acme' in Name with 'acme'",
"replaceany 'Acme' in Name with 'acme'",

Again NAME is passed in as an attribute to be matched by the semantic predicate, this part of the grammar works in my tests. The part that's failing is the 'any' part. The grammar matches to replace and then gets the next token which it thinks is 'Acme' ignoring the 'any' part in both examples above. What I was expecting here is the grammar to fail and in the Listener on the exit rule I have added a check which should throw a Runtime exception, which is caught by the GlueCode to indicate a failure.

Any ideas on how I can get my grammar to throw an error when this occurs?

Solution

First and foremost, lexer rules are always global in ANTLR. Every token in your input will be assigned one, and only one, token type. If you separate your lexer rules into multiple files, it becomes a maintenance nightmare to determine cases where tokens are ambiguous. The general rule is:

Avoid using import for lexer grammars which contain rules that are not marked with the fragment modifier.
The ATTR token will be assigned to inputs matching what looks like an ATTR, regardless of whether or not the predicate in the attr rule succeeds. This will prevent inputs which match the ATTR rule from being considered as another token type. You should move the semantic predicate from the attr rule to the ATTR rule to prevent the lexer for ever creating ATTR tokens for inputs which are not in the set of predefined attributes.
The ParserRuleContext.exception field is not guaranteed to be set in the event of a syntax error. The only way to determine that a syntax error did not occur is to call Parser.getNumberOfSyntaxErrors() after parsing, or add your own ANTLRErrorListener.
Your last lexer rule should resemble the following. Otherwise, input sequences which do not match a lexer rule will be silently dropped. This rule passes those inputs on to the parser for handling/reporting.
```
ErrorChar : . ;
```
For complicated grammars, avoid using combined grammars. Instead, create lexer grammar and parser grammar grammars, where the parser grammars use the tokenVocab option to import the tokens. Combined grammars allow you to implicitly declare lexer rules by writing string literals in parser rules, which reduces the maintainability of large grammars.
ReplaceInWith.g4 contains many rules with embedded actions. These actions should be moved to a separate listener that you run after parsing is complete, and the returns clauses from these rules should be removed. This improves both the portability and reusability of your grammar. An example of how to do this can be seen in these commits which are part of a larger pull request showing conversion of an application using ANTLR 3 to ANTLR 4.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow