سؤال

I reformulate a question I asked previously. The purpose is to understand how precedence works in parsing.

I would like to parse a statement a(3).value = 100. parser.mly as follows stops after reading ., and returns an error.

However, if I move the part dedicated to argument_list (en-globed by begin and end) to the end of the file (thus it is after l_expression), the parsing works well.

In the case where the statement is parsed, it is reduced to a let_statement of data_manipulation_statement; a(3).value is reduced to member_access_expression; a(3) is reduced to a index_expression; and 3 is reduced to argument_list.

In the case where the statement cannot be parsed, it seems that it tries to reduce the beginning of the statement to a call_statement of control_statement... it ends by an error.

I always thought that, no matter what the precedence are, parsing always rejects the reductions which cannot end by succeeding. But there it seems that it tried and failed, and refused to try other possibilities...

Could anyone help to clarify?

statement:
| control_statement { $1 }
| data_manipulation_statement  { BS_DMS $1 }

control_statement: | control_statement_except_multiline_if { BS_CSEMI $1 }
control_statement_except_multiline_if: | call_statement { $1 }
call_statement: | simple_name_expression argument_list { CSEMI_SNE_AL ($1, $2) }
data_manipulation_statement: | let_statement { $1 }
let_statement: | l_expression EQUAL expression { DMS_let (None, $1, $3) } 

simple_name_expression: | name { $1 } 
index_expression: | l_expression LPAREN argument_list RPAREN { IE_LE_AL ($1, $3) }
member_access_expression: | l_expression DOT unrestricted_name { MAE_LE_UN ($1, $3) }
literal_expression: | INTEGER { LIE_INT $1 }

unrestricted_name: | name { UN_N $1 }
name: | untyped_name { N_UN $1 }
untyped_name: | IDENTIFIER { UN_I $1 } 

expression: 
| l_expression { E_LE $1 }
| value_expression { E_VE $1 }

value_expression:
| literal_expression { VE_LIE $1 }
| parenthesized_expression { VE_PE $1 }

(***** begin argument_list *****)
argument_list: | positional_or_named_argument_list { AL_PONAL $1 }

positional_or_named_argument_list:
| positional_argument_COMMAs required_positional_argument { PONAL_PAs_RPA (List.rev $1, $2) }
positional_argument_COMMAs:
  /* empty */ { [] }
| positional_argument_COMMAs positional_argument COMMA { $2 :: $1 }

positional_argument: | argument_expression { $1 }
required_positional_argument: | argument_expression { $1 }
argument_expression: | expression { AE_expression (None, $1) }
(***** end argument_list *****)

l_expression:
| simple_name_expression { LE_SNE $1 } 
| index_expression { LE_IE $1 } 
| member_access_expression { LE_MAE $1 } 
هل كانت مفيدة؟

المحلول

There is no absolute rule about how parsing proceeds. It depends on the parser. Arguably, a correct parser should "always rejects the reductions which cannot end by succeeding", but you cannot in general do that with a linear-time left-to-right parser. A GLR parser (which bison can generate) could do that, possibly in up to cubic time, depending on the grammar. A backtracking parser -- such as most parsing combinator libraries -- can do that, but it's not easy to estimate the complexity of the algorithm. But an *LR(1) parser, which I think you are using, based on your tags, can only parser *LR(1) grammars, and your grammar is not one of those.

An LR(1) parser proceeds left-to-right (L) over the input, one token at a time. After reading each token, it "reduces" (that is, matches a production) all productions which end at the rightmost token (R) of the input. To do this, it uses as much information as it can keep about the progress of the parse ("the parse stack") and the next (1) token ("the lookahead"). They work with a finite-state pushdown automaton, and there are several ways of constructing these automata, which provide different approximations to the ideal PDA, but for the purposes of your grammar that doesn't matter. I believe ocamlyacc generates LALR(1) parsers, like the original yacc, but your grammar isn't even LR(1), so it certainly isn't parsable by a simplified LR(1) parser.

In the above description, I have not used the word "precedence" because it does not apply to LR parsing. There are such things as precedence parsers -- which are generally less powerful than LR parsers, but also easier to construct -- but they are not so common any more except in the form of hand-written parsers. (I'm not sure if there were ever precedence-grammar parser generators; once LALR(1) parsing was discovered, it rapidly took over the parser-generator market because it is more powerful than either precedence or LL(1) parsing.) However, "precedence" was added to yacc at some early stage in its development in order to cope with ambiguous grammars; or, often, grammars which could have been written unambiguously but which are more compact in their ambiguous form.

When an LR parser is deciding what to do, it is possible that there is more than one alternative. It could be because there is more than one possible reduction (a "reduce-reduce" conflict) or because it's not clear whether the available reduction is correct (a "shift-reduce" conflict). The conflicts arise because the grammar is ambiguous, or because it cannot be LR parsed with only the specified lookahead.

In either event, a pedantic LR parser generator would just report failure. But a pragmatic parser generator might try to figure out which of the alternatives is the desired one, and proceed accordingly. A simple way of doing this is to give one alternative precedence over the other one(s), based on some programmer-provided declaration (a "precedence declaration") or based on some heuristic ("prefer whatever comes first in the grammar file"). Such heuristics are dangerous in the sense that they can result in a parser which does not actually parse the language you expect to be parsed; IMHO no parser generator should apply a heuristic without at least warning you that it has done so.

Now, the practical issue: why your grammar is not LR(1).

Let's start with the following extract.

call_statement: simple_name_expression argument_list
let_statement: l_expression EQUAL expression
l_expression: index_expression
l_expression: simple_name_expression
index_expression: l_expression LPAREN argument_list RPAREN

Now consider the input: a (. That is, we've just read the token a and the lookahead token is (. a must be reduced to simple_name_expression (through several unit productions but the details are irrelevant). Now at this point, the parser state would include: (· indicates the current "point" in the productions):

call_statement:   simple_name_expression · argument_list
index_expression: l_expression · LPAREN argument_list RPAREN
l_expression:     simple_name_expression ·

That is, we could leave simple_name_expression as it is and go on to collect the argument_list, or we could reduce simple_name_expression to l_expression in order to go on to collect the rest of index_expression. Which one is it?

Unfortunately, we can't know the answer until at least when we reach the matching RPAREN. Even then, we might not know the answer, since the next token might be something which extends an l_expression (such as a . or another LPAREN), in which case we'll need to continue scanning. There's no telling when we'll find the answer, so no finite lookahead will suffice and this grammar -- although possibly unambiguous -- is not LR(k) for any k. (And precedence heuristics won't work, either.)

In passing, it's not clear to me that your grammar for call_statement is what you wanted. An argument_list can start with an ( since it must start with an expression and an expression can be parenthesized, but it doesn't have to start with a parenthesis. So the following would be a legitimate call_statement:

print a(3), value

If that's what you wanted, fine.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top