I'll do these backwards, because that way we go from easiest to hardest. In fact, I don't really have a solution for the first conflict.
The third conflict is the result of an actual ambiguity in the grammar. You need to get rid of the ambiguity:
Rule 96 enclosure -> parenth_form
Rule 97 enclosure -> set_display
Rule 99 parenth_form -> LPAREN expression RPAREN
Rule 102 set_display -> LPAREN argument_list RPAREN
Rule 133 argument_list -> expression
Consequently, if we're looking for an enclosure
and we find a simple parenthesized expression, it could either be a parenth_form
or it could be a set_display
containing an argument_list
of exactly one expression. I suspect that the intention here is that a simple parenthesized expression would be a parenth_form
, but there's no way to tell from the grammar.
The simplest solution would be to just get rid of parenth_form
altogether, and check for the case of a one-element argument_list
when you build the AST node for a set_display
corresponding to rule 102. Another possibility is to be explicit about it; change Rule 102 to require the set_display
to have at least two expressions:
set_display -> LPAREN expression COMMA argument_list RPAREN
That still requires you to juggle the AST, though, because you have to prepend the expression
to the argument_list
when you build the set_display
node.
The second S/R conflict is actually quite similar. It arises because of:
Rule 104 list_display -> LBRACKET expression COMMA expression RANGE expression RBRACKET
Rule 105 list_display -> LBRACKET argument_list RBRACKET
So:
LBRACKET expression COMMA expression ...
would require reduction by Rule 104 if the following symbol is RANGE
; by Rule 105 if the following symbol is RBRACKET
; and by Rule 134 if the following symbol is COMMA
. (That's a rough approximation, since it assumes that we already know the end of the second expression
.) As written, though, the grammar needs to commit to one of these paths as soon as it sees the first COMMA
, because it needs to decide at that moment whether to create an argument_list
or not.
The solution is to delay the parser's decision, which is easy but ugly:
list_display -> LBRACKET expression RANGE expression RBRACKET
list_display -> LBRACKET expression COMMA expression RANGE expression RBRACKET
list_display -> LBRACKET expression RBRACKET
list_display -> LBRACKET expression COMMA argument_list RBRACKET
Now, the first COMMA
is always shifted and the decision on what type of list_display
to reduce is delayed until the end of the second expression
(if there are two expression
s), but it's necessary to juggle the AST for the last two productions to correct the argument_list
.
The first S/R conflict arises because IN
is used both as an operator and as a syntactic part of an iterator
:
Rule 44 comparison -> sum IN sum
Rule 137 iterator -> target IN expression
But because target
is just an expression
, and expression
can derive sum
, it's not possible (most of the time) for the parser to know which IN
it's looking at until much later in the parse.
The previous technique of deferring the decision won't work here, because you need to know which type of IN
you're looking at in order to correctly apply operator precedence. Suppose we're in a context where we need an iterator
and the input is:
atom1 AND atom2 IN atom3
If that's the iterator (i.e., the next symbol is COMMA
or RPAREN
), then that is, effectively:
( atom1 AND atom2 ) IN atom3
However, if that's the left-hand side of an iterator, then it needs to be parsed completely differently:
( atom1 AND ( atom2 IN atom3 ) ) IN expression
Moreover, atom3
could have been an arbitrary expression, perhaps atom3 AND atom4
, leading to the two parses:
( atom1 AND atom2 ) IN ( atom3 AND atom4 )
( atom1 AND ( atom2 IN atom3 ) AND atom4 ) IN expression
This is why puns are bad in language design.
I strongly suspect that there is no LR(k)
grammar which will be able to parse that particular corner of your language, although that's just based on intuition; I have no proof. However, a GLR parser would have no trouble with it, because it is not actually ambiguous. I don't know if there is a GLR parser generator in Python; if you're not tied to Python, you could certainly use bison
.
The GLR parser would also have solved the second conflict, which is also not the result of an ambiguity.