Your grammar is not LALR(1)
. In fact, it's not even LR(1)
.
Consider the following two valid e_expression
s:
R[1]C-R[2]C
R[1]C-1-R[2]C
In the first case, after we've shifted the C
, we will reach the following:
R [ index ] C -R[2]C
and we'll then want it to reduce:
e_cell -R[2]C
and reduce again to
e_expression -R[2]C
and then
e_expression - e_expression
In the second case, we'll get to:
R [ index ] C -1-R[2]C
and then
R [ index ] C - 1-R[2]C
R [ index ] C index -R[2]C
e_cell -R[2]C
(at this point, we've reached a similar position to the first input, so I'll leave out the next steps).
So, after we shift the C
, the lookahead is -
, and we need to either:
reduce R [ index ] C
to e_cell
, or
shift the -
, giving R [ index ] C -
We can't tell which without one more lookahead: the following token must be either R
(case 1) or INTEGER
(case 2).
So we might say that the grammar is LALR(2), except that there is another shift-reduce conflict with respect to the minus sign which makes the grammar ambiguous and hence not LALR(k) for any k. It's possible that you've already dealt with this one using operator precedence declarations, but just in case:
Suppose you've reached:
e_expression - e_expression
and the lookahead is -
. Now it could reduce e_expression - e_expression
to e_expression
and then shift the -
, resulting in:
e_expression -
Or it could simply shift the -
:
e_expression - e_expression -
No matter how much forward context we read, it's impossible to decide between these two because they both lead to valid parses. The first parse will make -
left-associative, and the second one right-associative.
If you don't solve this with precedence declarations, you could choose exactly one of the following, instead of e_expression: e_expression MINUS e_expression
:
e_expression: e_cell MINUS e_expression
e_expression: e_expression MINUS e_cell
Now, how to solve the original problem :)
The easiest solution, if the -
in -1
can be considered just part of a negative integer, is to let the lexer handle it. Then the parser won't see a MINUS
in R[-1]C-1
, so it won't attempt to reduce R[-1]C
.
Another solution would be to use a GLR parser (apparently there is one for OCaml, but I know nothing about it).
Finally, it is possible to mechanically create an LR(1) grammar given an LR(2) grammar, along with a mechanism to extract the original parse tree. The resulting grammar is usually bloated and painful to write by hand, but the translation could be done automatically. Unfortunately, I don't know of any OCaml tools which do this. The basic idea is to split every non-terminal into a set of pairs which become the new non-terminals. You can easily expand all the existing rules into the new set of nonterminals. Now, since each nonterminal effectively includes one lookahead token, a one-token lookahead is equivalent to a two-token lookahead in the original language.