Question

I would like to parse a set of expressions: R[3]C, R[2]C, R[3]C-R[2]C... There is a conflict I cannot solve...

Here is a part of lexer.mll:

  rule token = parse
  | 'R'            { R }
  | 'C'            { C }
  | "RC"           { RC }
  | ['0'-'9']+ as lxm { INTEGER (int_of_string lxm) }
  | '+'            { PLUS }
  | '-'            { MINUS }
  | '['            { LBRACKET }
  | ']'            { RBRACKET }
  | eof            { EOF }
  ...

A part of parser.mly:

main:
   e_expression EOF                { $1 };

e_expression:
| ec = e_cell { EE_rc (Rc.Cell ec) }
| e_expression MINUS e_expression { EE_string_EEL ("MINUS", [$1; $3]) }

e_cell:
| R LBRACKET r = index RBRACKET C c = index { (Rc.I_relative r, Rc.I_absolute c) }
| R LBRACKET r = index RBRACKET C { (Rc.I_relative r, Rc.I_relative 0) }

index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n\n" 8; 0 - $2 }

This code curiously does not work with R[3]C-R[2]C, here is parser.conflicts, that I can not really understand.

If I comment the line line | R LBRACKET r = index RBRACKET C c = index ... in e_cell, the code can parse R[3]C-R[2]C, where 3 and 2 are index, `R[3]C and R[2]C are e_cell, and R[3]C-R[2]C is e_expression.

Could anyone help?

Was it helpful?

Solution 2

So the problem seems to be that when it sees a "-" token after a ], the parser is unsure whether it is making an index, or if it is separating two expressions.

i.e. When the parser reaches R[3]C-, it isn't sure whether it needs to wait for an INTEGER to complete the e_cell and reduce, or reduce now and start work on another e_expression.

The best way to solve this would probably be to move the negative integer code into the lexer. I don't have an ocamllex install handy, but I think changing

['0'-'9']+

to

'-'? ['0'-'9']+ 

would work, and then remove the negative integer case from index (obviously this would cause a problem with the Printf statement, but you can make the internal logic more complex to account for this.

OTHER TIPS

Your grammar is not LALR(1). In fact, it's not even LR(1).

Consider the following two valid e_expressions:

R[1]C-R[2]C
R[1]C-1-R[2]C

In the first case, after we've shifted the C, we will reach the following:

R [ index ] C -R[2]C

and we'll then want it to reduce:

e_cell -R[2]C

and reduce again to

e_expression -R[2]C

and then

e_expression - e_expression

In the second case, we'll get to:

R [ index ] C -1-R[2]C 

and then

R [ index ] C - 1-R[2]C
R [ index ] C index -R[2]C
e_cell -R[2]C

(at this point, we've reached a similar position to the first input, so I'll leave out the next steps).

So, after we shift the C, the lookahead is -, and we need to either:

  1. reduce R [ index ] C to e_cell, or

  2. shift the -, giving R [ index ] C -

We can't tell which without one more lookahead: the following token must be either R (case 1) or INTEGER (case 2).

So we might say that the grammar is LALR(2), except that there is another shift-reduce conflict with respect to the minus sign which makes the grammar ambiguous and hence not LALR(k) for any k. It's possible that you've already dealt with this one using operator precedence declarations, but just in case:

Suppose you've reached:

e_expression - e_expression

and the lookahead is -. Now it could reduce e_expression - e_expression to e_expression and then shift the -, resulting in:

e_expression -

Or it could simply shift the -:

e_expression - e_expression -

No matter how much forward context we read, it's impossible to decide between these two because they both lead to valid parses. The first parse will make - left-associative, and the second one right-associative.

If you don't solve this with precedence declarations, you could choose exactly one of the following, instead of e_expression: e_expression MINUS e_expression:

e_expression: e_cell MINUS e_expression
e_expression: e_expression MINUS e_cell

Now, how to solve the original problem :)

The easiest solution, if the - in -1 can be considered just part of a negative integer, is to let the lexer handle it. Then the parser won't see a MINUS in R[-1]C-1, so it won't attempt to reduce R[-1]C.

Another solution would be to use a GLR parser (apparently there is one for OCaml, but I know nothing about it).

Finally, it is possible to mechanically create an LR(1) grammar given an LR(2) grammar, along with a mechanism to extract the original parse tree. The resulting grammar is usually bloated and painful to write by hand, but the translation could be done automatically. Unfortunately, I don't know of any OCaml tools which do this. The basic idea is to split every non-terminal into a set of pairs which become the new non-terminals. You can easily expand all the existing rules into the new set of nonterminals. Now, since each nonterminal effectively includes one lookahead token, a one-token lookahead is equivalent to a two-token lookahead in the original language.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top