Question

The DSL I'm working on allows users to define a 'complete text substitution' variable. When parsing the code, we then need to look up the value of the variable and start parsing again from that code.

The substitution can be very simple (single constants) or entire statements or code blocks. This is a mock grammar which I hope illustrates my point.

grammar a;

entry
  : (set_variable
  | print_line)*
  ;

set_variable
  : 'SET' ID '=' STRING_CONSTANT ';'
  ;

print_line
  : 'PRINT' ID ';'
  ;

STRING_CONSTANT: '\'' ('\'\'' | ~('\''))* '\'' ;

ID: [a-z][a-zA-Z0-9_]* ;

VARIABLE: '&' ID;

BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;

Then the following statements executed consecutively should be valid;

SET foo = 'Hello world!';
PRINT foo;            

SET bar = 'foo;'
PRINT &bar                    // should be interpreted as 'PRINT foo;'

SET baz = 'PRINT foo; PRINT'; // one complete statement and one incomplete statement
&baz foo;                     // should be interpreted as 'PRINT foo; PRINT foo;'

Any time the & variable token is discovered, we immediately switch to interpreting the value of that variable instead. As above, this can mean that you set up the code in such a way that is is invalid, full of half-statements that are only completed when the value is just right. The variables can be redefined at any point in the text.

Strictly speaking the current language definition doesn't disallow nesting &vars inside each other, but the current parsing doesn't handle this and I would not be upset if it wasn't allowed.

Currently I'm building an interpreter using a visitor, but this one I'm stuck on.

How can I build a lexer/parser/interpreter which will allow me to do this? Thanks for any help!

Was it helpful?

Solution

So I have found one solution to the issue. I think it could be better - as it potentially does a lot of array copying - but at least it works for now.

EDIT: I was wrong before, and my solution would consume ANY & that it found, including those in valid locations such as inside string constants. This seems like a better solution:

First, I extended the InputStream so that it is able to rewrite the input steam when a & is encountered. This unfortunately involves copying the array, which I can maybe resolve in the future:

MacroInputStream.java

    package preprocessor;

    import org.antlr.v4.runtime.ANTLRInputStream;

    public class MacroInputStream extends ANTLRInputStream {

      private HashMap<String, String> map;

      public MacroInputStream(String s, HashMap<String, String> map) {
        super(s);
        this.map = map;
      }

      public void rewrite(int startIndex, int stopIndex, String replaceText) {
        int length = stopIndex-startIndex+1;
        char[] replData = replaceText.toCharArray();
        if (replData.length == length) {
          for (int i = 0; i < length; i++) data[startIndex+i] = replData[i];
        } else {
          char[] newData = new char[data.length+replData.length-length];
          System.arraycopy(data, 0, newData, 0, startIndex);
          System.arraycopy(replData, 0, newData, startIndex, replData.length);
          System.arraycopy(data, stopIndex+1, newData, startIndex+replData.length, data.length-(stopIndex+1));
          data = newData;
          n = data.length;
        }
      }
    }

Secondly, I extended the Lexer so that when a VARIABLE token is encountered, the rewrite method above is called:

MacroGrammarLexer.java

package language;

import language.DSL_GrammarLexer;

import org.antlr.v4.runtime.Token;

import java.util.HashMap;

public class MacroGrammarLexer extends MacroGrammarLexer{

  private HashMap<String, String> map;

  public DSL_GrammarLexerPre(MacroInputStream input, HashMap<String, String> map) {
    super(input);
    this.map = map;
    // TODO Auto-generated constructor stub
  }

  private MacroInputStream getInput() {
    return (MacroInputStream) _input;
  }

  @Override
  public Token nextToken() {
    Token t = super.nextToken();
    if (t.getType() == VARIABLE) {
      System.out.println("Encountered token " + t.getText()+" ===> rewriting!!!");
      getInput().rewrite(t.getStartIndex(), t.getStopIndex(),
          map.get(t.getText().substring(1)));
      getInput().seek(t.getStartIndex()); // reset input stream to previous
      return super.nextToken();
    }
    return t;   
  }   

}

Lastly, I modified the generated parser to set the variables at the time of parsing:

DSL_GrammarParser.java

    ...
    ...
    HashMap<String, String> map;  // same map as before, passed as a new argument.
    ...
    ...

public final SetContext set() throws RecognitionException {
  SetContext _localctx = new SetContext(_ctx, getState());
    enterRule(_localctx, 130, RULE_set);
    try {
        enterOuterAlt(_localctx, 1);
        {
        String vname = null; String vval = null;              // set up variables
        setState(1215); match(SET);
        setState(1216); vname = variable_name().getText();    // set vname
        setState(1217); match(EQUALS);
        setState(1218); vval = string_constant().getText();   // set vval
        System.out.println("Found SET " + vname +" = " + vval+";");
            map.put(vname, vval);
        }
    }
    catch (RecognitionException re) {
        _localctx.exception = re;
        _errHandler.reportError(this, re);
        _errHandler.recover(this, re);
    }
    finally {
        exitRule();
    }
    return _localctx;
}
    ...
    ...

Unfortunately this method is final so this will make maintenance a bit more difficult, but it works for now.

OTHER TIPS

The standard pattern to handling your requirements is to implement a symbol table. The simplest form is as a key:value store. In your visitor, add var declarations as encountered, and read out the values as var references are encountered.

As described, your DSL does not define a scoping requirement on the variables declared. If you do require scoped variables, then use a stack of key:value stores, pushing and popping on scope entry and exit.

See this related StackOverflow answer.

Separately, since your strings may contain commands, you can simply parse the contents as part of your initial parse. That is, expand your grammar with a rule that includes the full set of valid contents:

set_variable
   : 'SET' ID '=' stringLiteral ';'
   ;

stringLiteral: 
   Quote Quote? ( 
     (    set_variable
        | print_line
        | VARIABLE
        | ID
     )
     | STRING_CONSTANT  // redefine without the quotes
   )
   Quote
   ;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top