Resolving shift/reduce conflicts with PLY

Question

I'll do these backwards, because that way we go from easiest to hardest. In fact, I don't really have a solution for the first conflict.

The third conflict is the result of an actual ambiguity in the grammar. You need to get rid of the ambiguity:

Rule 96    enclosure -> parenth_form
Rule 97    enclosure -> set_display
Rule 99    parenth_form -> LPAREN expression RPAREN
Rule 102   set_display -> LPAREN argument_list RPAREN
Rule 133   argument_list -> expression

Consequently, if we're looking for an enclosure and we find a simple parenthesized expression, it could either be a parenth_form or it could be a set_display containing an argument_list of exactly one expression. I suspect that the intention here is that a simple parenthesized expression would be a parenth_form, but there's no way to tell from the grammar.

The simplest solution would be to just get rid of parenth_form altogether, and check for the case of a one-element argument_list when you build the AST node for a set_display corresponding to rule 102. Another possibility is to be explicit about it; change Rule 102 to require the set_display to have at least two expressions:

set_display -> LPAREN expression COMMA argument_list RPAREN

That still requires you to juggle the AST, though, because you have to prepend the expression to the argument_list when you build the set_display node.

The second S/R conflict is actually quite similar. It arises because of:

Rule 104   list_display -> LBRACKET expression COMMA expression RANGE expression RBRACKET
Rule 105   list_display -> LBRACKET argument_list RBRACKET

So:

LBRACKET expression COMMA expression ...

would require reduction by Rule 104 if the following symbol is RANGE; by Rule 105 if the following symbol is RBRACKET; and by Rule 134 if the following symbol is COMMA. (That's a rough approximation, since it assumes that we already know the end of the second expression.) As written, though, the grammar needs to commit to one of these paths as soon as it sees the first COMMA, because it needs to decide at that moment whether to create an argument_list or not.

The solution is to delay the parser's decision, which is easy but ugly:

list_display -> LBRACKET expression RANGE expression RBRACKET
list_display -> LBRACKET expression COMMA expression RANGE expression RBRACKET
list_display -> LBRACKET expression RBRACKET
list_display -> LBRACKET expression COMMA argument_list RBRACKET

Now, the first COMMA is always shifted and the decision on what type of list_display to reduce is delayed until the end of the second expression (if there are two expressions), but it's necessary to juggle the AST for the last two productions to correct the argument_list.

The first S/R conflict arises because IN is used both as an operator and as a syntactic part of an iterator:

Rule 44    comparison -> sum IN sum
Rule 137   iterator -> target IN expression

But because target is just an expression, and expression can derive sum, it's not possible (most of the time) for the parser to know which IN it's looking at until much later in the parse.

The previous technique of deferring the decision won't work here, because you need to know which type of IN you're looking at in order to correctly apply operator precedence. Suppose we're in a context where we need an iterator and the input is:

atom1 AND atom2 IN atom3

If that's the iterator (i.e., the next symbol is COMMA or RPAREN), then that is, effectively:

( atom1 AND atom2 ) IN atom3

However, if that's the left-hand side of an iterator, then it needs to be parsed completely differently:

( atom1 AND ( atom2 IN atom3 ) ) IN expression

Moreover, atom3 could have been an arbitrary expression, perhaps atom3 AND atom4, leading to the two parses:

( atom1 AND atom2 ) IN ( atom3 AND atom4 )
( atom1 AND ( atom2 IN atom3 ) AND atom4 ) IN expression

This is why puns are bad in language design.

I strongly suspect that there is no LR(k) grammar which will be able to parse that particular corner of your language, although that's just based on intuition; I have no proof. However, a GLR parser would have no trouble with it, because it is not actually ambiguous. I don't know if there is a GLR parser generator in Python; if you're not tied to Python, you could certainly use bison.

The GLR parser would also have solved the second conflict, which is also not the result of an ambiguity.