Why is there no viable alternative for #include statement in ANTLR 4 with C grammar?

https://stackoverflow.com//questions/25010496

20-12-2019
|

Question

I'm just getting started with ANTLR v4 and I am a bit confused...

I am using the C grammar file from the antlr project here to work with the following bit of C:

#include <stdio.h>

int main()
{
   printf("Hello");
   return 0;
}

(saved as C:\Users\Public\t.c).

I generated the C parser like so:

java -cp lib/antlr-4.4-complete.jar org.antlr.v4.Tool -o src/cparser src/C.g4

And I edited the generated files to put a package statement at the top.

I then whipped up a little Java project including these generated files, referencing antlr-runtime-4.4.jar with a main class that looks like so:

package antlrtest;

import java.io.IOException;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import cparser.CLexer;
import cparser.CParser;
import cparser.CParser.CompilationUnitContext;

public class AntlrTestMain {
    public static void main(String[] arguments) {
        try {           
            CParser parser = new CParser(
                new CommonTokenStream(
                        new CLexer(
                                new ANTLRFileStream("C:\\Users\\Public\\t.c"))));

            parser.setBuildParseTree(true);

            // This line prints the error
            CompilationUnitContext ctx = parser.compilationUnit();

            MyListener listener = new MyListener();
            ParseTreeWalker.DEFAULT.walk(listener, ctx);            
        } catch (IOException e) {
            e.printStackTrace();
        }
    }   
}

And for completeness, though I don't think it is important, the listener looks like this (just empty, I plan to put something in here of course):

package antlrtest;

import cparser.CBaseListener;

public class MyListener extends CBaseListener {
}

Now what happens when I run that is when I call the compilationUnit method I get the following errors printed to the console:

line 1:0 token recognition error at: '#i'
line 1:9 no viable alternative at input 'nclude<'

I'm pretty sure the C code is valid and I have not edited the C.g4 file at all so what am I doing wrong here - why do I get these errors?

Is calling compilationUnit() the wrong thing to do perhaps, if so what should I call to pass into the tree walker?

Solution

The problem is:

You cannot parse a file in general unless it was preprocessed first. That's probably why preprocessor stuff is only included to a very limited extend. Some simple example:

#define FOO  if (a
void main ()
{
    int a;
    FOO );
}

So you have to create a preprocessor grammar first. I've done something similar and did it this way:

Tokenize the complete file
Let the preprocessor parser do its job and replace some preprocessor tokens with "virtual" tokens that stand for the preprocessor macro's replacement (here: if, a, ().
Use the regular parser using the modified token stream.

What you can do is the following:

Add a rule for includes to the grammar file at the end of the file (so other preprocessor stuff will be matched if possible):

SomePreprocessorStuff
     :   '#' ~[\r\n]*
          -> skip
     ;

OTHER TIPS

The C grammar included with the ANTLR project requires preprocessed source files as input. The grammar does not perform any file inclusion, macro expansion, or any other feature provided by the preprocessor. If you do not perform preprocessing prior to using this grammar, the parse tree it produces will not be an accurate representation of the compilation unit.

Note that skipping "preprocessor stuff" is not an alternative to using the preprocessor in advance, since file inclusion is only one part of the preprocessor.

As an update, I had a look at the JCPP preprocessor and got it working by just wrapping it in a Reader using the CppReader that is included in said preprocessor.

This is not really the best (in terms of efficiency at least) approach, you should probably build a TokenStream from JCPP's token stream since here we are lexing twice (once by JCPP in order for it to be able to pre-process and then again by ANTLR) but as a way to get it going it works and at least in my basic test it seems to be preprocessing correctly.

So, anyway, here's the code from the question, updated, to preprocess using JCPP:

public class AntlrTestMain {

    public static void main(String[] args) {

        String mainFileName = "C:\\Users\\Public\\t.c";

        try {
            // Construct the preprocessor with the main file to look at
            Preprocessor pp = new Preprocessor(new File(mainFileName));

            // Set up the preprocessor - you probably want to set more stuff
            // here than just the include path - have a look in the javadoc
            List<String> systemInclude = new ArrayList<String>();
            systemInclude.add("C:\\MYCPPCOMPILER\\include");            
            pp.setSystemIncludePath(systemInclude);

            // Get the parser by wrapping up the preprocessor in a reader
            CParser parser = new CParser(
                new CommonTokenStream(
                    new CLexer(
                        new ANTLRInputStream(new CppReader(pp)))));

            // Use ANTLR to do whatever you want...
            parser.setBuildParseTree(true);
            MyListener listener = new MyListener();
            ParseTreeWalker.DEFAULT.walk(listener, parser.compilationUnit());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

You will need these imports for the above code:

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.anarres.cpp.CppReader;
import org.anarres.cpp.Preprocessor;

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import cparser.CLexer;
import cparser.CParser;

I don't think there is anything wrong with your code. The grammar file just does not have a rule defined for #include <foo.h>.

So what you could do is extending the grammar (which could be rather complicated when you are not familiar with antlr) or delete the include-statement for now to get antlr work with your grammar.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow