how to resolve ambiguity for phone number in Boolean expression grammar

https://stackoverflow.com/questions/21126661

antlr3

28-09-2022
|

Question

I am trying to write a Boolean expression grammar that can treat WHITE_SPACE as an implicit logic AND. e.g., "A B" means "A AND B".

However, I would also like to treat the US-formatted phone number as a single toke, e.g., (123) 456-7890. My grammar can cover most cases, but is still facing grammar ambiguity on the AREA_CODE.

Here is my grammar:

grammar myBooleanExpr;

options
{
  language = Java;
  output = AST;
}

tokens {
  AND;
}

fragment DIGIT : '0'..'9';
fragment AREA_CODE : LPAREN DIGIT+ RPAREN;
fragment NUMBER : ( DIGIT | '-' )+;
LPAREN : '(' ;
RPAREN : ')' ;
WS : ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; };
L_AND: 'AND'| 'And' | 'and';
OR : 'OR' | 'Or' | 'or';
NOT : 'NOT' | 'Not' | 'not';
NAME : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"') )*;
PHONE : AREA_CODE ' '? NUMBER?;
QUOTED_NAME : '"'.*'"';

expression : orexpression;
orexpression : andexpression (OR^ andexpression)*;
andexpression : notexpression (L_AND? notexpression)* -> ^(AND notexpression+); 
notexpression : NOT^ atom | atom;
atom :  NAME | PHONE | QUOTED_NAME | LPAREN! orexpression RPAREN!;

Input vs. Expected Output:

(123) 456-7890 -> (123) 456-7890 // single token

(123) abc -> 123 AND abc // two tokens

(123456) 789 -> 123456 AND 789 // two tokens ### currently failed

(12 34) -> 12 AND 34 // two tokens ### currently failed

(123) 456-aaaa -> 123 AND 456-aaaa // two tokens ### currently failed

abc efg AND hij -> abc AND efg AND hij // three tokens

It is very difficult for me to understand the usage of input.LA(1) or so. Very appreciated if someone could jump in helping me on this issue.

Solution

I think you are trying to put too much into lexer rules. Parsing telephone numbers like that needs more flexibility, e.g. a single space char might not be enough and what about tabs? Instead you should lex all individual tokens (numbers, punctuation etc.) as usual and do a semantic check once you have the syntax tree from the parser run.

It's up to you to decide if a the space between two tokens is just that or can be interpreted as logical operation (here AND). Neither the parser nor the lexer can know that, it depends on the context. This is why you cannot make that grammar free of ambiquities.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow