Java ANTLR how to ignore part of rule? ignore part after subrule

https://stackoverflow.com/questions/12114429

28-06-2021
|

Domanda

I'm trying to create a compiler using ANTLR and Java. I have this problem where I have a rule and I can't get just a part of it to use. I have a command e.g. 0: HALT 0,0,0 and I want to ignore everything else after that.

e.g.0: HALT 0,0,0 blah blah blah, I want to ignore the blah blah blah

my rule is:

    rule returns [String value]
    :
    INTEGER':' ro=rocommand i1=INTEGER',' i2=INTEGER ',' i3=INTEGER rest {$value = $ro.text+" "+$i1.text+","+$i2.text+","+$i3.text;   }
    | INTEGER':' rm=rmcommand j1=INTEGER ',' j2=INTEGER '('j3=INTEGER')' rest {$value = $rm.text+" "+$j1.text+","+$j2.text+"("+$j3.text+")"; }
;

and the code I have is:

CharStream charStream = new ANTLRStringStream(strLine);
simulatorLexer lexer = new simulatorLexer(charStream);
TokenStream tokenStream = new CommonTokenStream(lexer);
simulatorParser parser = new simulatorParser(tokenStream);
System.out.println(parser.rule());

What I get is:

0: rule:IN 0,0,0
1: rule:LDC 1,1,0
line 1:15 no viable alternative at character 'r'
line 1:18 no viable alternative at character '='
line 1:15 no viable alternative at character 'i'

for the text:

0: rule:IN 0,0,0
1: rule:LDC 1,1,0 r1=0

So it should parse the first line correctly and the 2nd until the 0. then it should ignore r1=0. It works correctly until now, but it shows a number of errors and I want to get rid of them. Please help me!

EDIT

I'm posting the whole grammar so you can help me better. I just want to recognize the rule part.

program:
    rule+
;


rocommand:
    'HALT'|'IN'|'OUT'|'ADD'|'SUB'|'MUL'|'DIV'|'LDC'
;

rmcommand:
    'LD'|'LDA'|'LDC'|'ST'|'JLT'|'JLE'|'JGE'|'JGT'|'JEQ'|'JNE' 
;

rest:
  ~('\n'|'\r')* '\r'? ('\n'|EOF)
;

rule returns [String value]
    :
    INTEGER':' ro=rocommand i1=INTEGER',' i2=INTEGER ',' i3=INTEGER rest {$value = $ro.text+" "+$i1.text+","+$i2.text+","+$i3.text;   }
    | INTEGER':' rm=rmcommand j1=INTEGER ',' j2=INTEGER '('j3=INTEGER')' rest {$value = $rm.text+" "+$j1.text+","+$j2.text+"("+$j3.text+")"; }
;

WS  : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;};
INTEGER : '0'..'9'+;
IGNORELINE : '*' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};

Alex

Soluzione

There are a couple of things wrong with the rule:

rest:
  ~('\n'|'\r')* '\r'? ('\n'|EOF)
;

Inside parser rules, the ~ negates the entire set of tokens the lexer produces. So ~('\n'|'\r') does not not match a single character other than '\n' or '\r'. It matches any token other than the tokens that matched \r or \n.

Also, since your lexer puts '\n' and '\r' on the hidden-channel, these token will not be available in your parser. This means that the '\n' in the rest rule can never be matched.

In short: you can't "tell" your parser what the end of a line is since these characters are discarded by your WS rule. This means you have no way to properly write such a rest parser rule.

For your input:

0: IN 0,0,0
1: LDC 1,1,0 r1=0

(note that I removed the 'rule:''s)

the following tokens are produced by your lexer:

token[type=INTEGER text='0']
token[type=':'     text=':']
token[type='IN'    text='IN']
token[type=INTEGER text='0']
token[type=','     text=',']
token[type=INTEGER text='0']
token[type=','     text=',']
token[type=INTEGER text='0']
token[type=INTEGER text='1']
token[type=':'     text=':']
token[type='LDC'   text='LDC']
token[type=INTEGER text='1']
token[type=','     text=',']
token[type=INTEGER text='1']
token[type=','     text=',']
token[type=INTEGER text='0']
token[type=INTEGER text='1']
token[type=INTEGER text='0']

So these are the tokens available in your parser rules.

Note that the following two characters: '=' and 'r' cannot be matched by the lexer as you can see by looking at the errors:

line 2:13 no viable alternative at character 'r'
line 2:15 no viable alternative at character '='

A possible solution would be to create a lexer rule that matches an integer and a colon:

START : INTEGER ':';

and let your rule start with this token:

rule
 : START ro=rocommand i1=INTEGER ',' i2=INTEGER ',' i3=INTEGER rest ...
 | ...
 ;

That way, your rest can match zero or more tokens other than that START token:

rest
 : ~START*
 ;

And to capture the '=' and 'r' characters, create an ANY rule and put this rule at the end of your lexer rules:

ANY : . ; // match any char

That way, the parser will create the following parse tree:

enter image description here

Another solution would be to create a LINE_BREAK token:

LINE_BREAK : '\r'? '\n' | '\r';

(and remove \r and \n from WS, of course!)

And do something like this:

rule
 : INTEGER ':' ro=rocommand i1=INTEGER ',' i2=INTEGER ',' i3=INTEGER rest LINE_BREAK ...
 | ...
 ;

rest
 : ~LINE_BREAK*
 ;

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow