Pergunta

The question I wish to ask is succintly given in the title. Let me give an example of the grammar in question:

identifier_list
    : identifier
    | identifier_list identifier;

lambda_arguments
    : '(' identifier_list ')'
    | identifier;

lambda
    : lambda_arguments '=>' expression

Then we add in the normal C expression grammar- particularly,

primary_expression
    : '(' expression ')'
    | identifier
    | lambda;

The real question is, is this grammar LALR(1) parsable, i.e., capable of being parsed by automated parser generators? Or does it require a hand-rolled or GLR parser? Note that I wish to know specifically about this subsection, not the context-sensitive keywords stuff or any other section.

What I'm thinking right now is that if the parser sees '(' identifier ')', this has two valid parses, so if the parser sees identifier, looks ahead to ')', it won't be able to decide which parse tree to go down. This could just be a shift/reduce conflict, though, that I could eliminate by assigning some arbitrary precedence (probably favouriing '(' identifier ')' ).

Edit: Actually, I was considering stealing using this subsection of the grammar for a similar feature in a new language. I already have anonymous functions similar to JavaScript in grammatical form but my guinea pigs squeaks feedback complains they are too verbose for many uses, and pointed out the C# lambda expressions as a more ideal solution. I was concerned about potential ambiguity resulting from this solution. So, really, I was only interested in that subsection. The other stuff like generics and casts are non-issues for me.

Previous editions of my grammar are mechanically parsable and I wouldn't want to lose this property, and my previous experience with a mechanical generator tells me that it's best to check here rather than try myself. For my hand-rolled parser, I could of course simply special case '(' identifier to look ahead a bit further than normal.

Foi útil?

Solução 2

An expression grammar augmented with C#-style lambdas is not LALR(1), but it's probably LALR(2). Consequently, it is possible (though not necessarily trivial) to produce an equivalent LALR(1) grammar: see edit below.

You're going to get a reduce/reduce conflict on the input:

( id )

because id can either be reduced to identifier_list or to expression (indirectly, in the second case), and the parser cannot tell which one is correct based on one lookahead token ()).

It could tell based on two lookahead tokens, since the identifier_list reduction is only possible if the second next token is =>, and as long as => is not an operator in your language, the expression reduction is not possible if the second next token is =>. So I think it's probably LALR(2), although I can't say that with certainty.

The case where there is more than one identifier is not problematic, since in

( id1 id2 )

id1 id2 cannot be reduced to an expression (in most expression languages; yours may, of course, differ). The case where a single unparenthesized identifier is immediately followed by => is also not problematic provided that `=>' is not a valid operator.

Edit

I neglected to mention in my original answer that there is no such thing as an LALR(2) language. The language recognized by a LALR(2) grammar is also recognized by some LALR(1) grammar. In fact, there is a constructive proof of this assertion, which allows the mechanical creation of such an LALR(1) grammar, along with a procedure to recover the original parse tree.

In this case, it's even simpler to generate an LALR(1) grammar, since as mentioned above there is only one production which requires additional lookahead. The solution is to delay the reduction by one token. In other words, in the original grammar includes something like:

primary:           '(' expression ')'
lambda_parameters: '(' id_list ')'

where both id_list and expression derive the terminal ID. Aside from ID, the derivations of these two non-terminals are disjoint, so we could resolve the issue as follows:

primary:           '(' expression_not_id ')'
       |           '(' ID ')'


lambda_parameters: '(' id_list_not_id ')'
                 | '(' ID ')'

It remains only to divide the productions for expression and id_list so as to separate out the ID case, which turns out to be not very difficult. Below is a simplified example, which could be easily extended; it's restricted to addition, multiplication and function application (which I included to demonstrate that the two comma-separated lists are not a problem):

%token ID LITERAL RIGHT_ARROW
%start expr
%%
primary: primary_not_id | ID ;
term:    term_not_id    | ID ;
sum:     sum_not_id     | ID ;
expr:    expr_not_id    | ID ;

expr_list: expr         | expr_list ',' expr ;
arguments: '(' ')'      | '(' expr_list ')' ;

ids: ID ',' ID          | ids ',' ID ;
parameters: '(' ID ')'  | '(' ids ')' ;

primary_not_id: LITERAL
              | '(' expr_not_id ')'
              | '(' ID ')'
              | primary arguments
              ;

term_not_id: primary_not_id
           | term '*' primary
           ;

sum_not_id: term_not_id
          | sum '+' term
          ;

expr_not_id: sum_not_id
           | parameters RIGHT_ARROW expr
           ;

Note: The grammar in the OP produces lambdas with multiple parameters as a sequence of identifiers not separated by commas: (a b) => a + b. I think that the actual intention was to use commas: (a, b) => a + b, and that's what I did in the above grammar. The difference is important if your language has a comma operator, as the C family does, because in that case an expression might be '(' expression_list ')', which conflicts with a lambda parameter list. A naïve implementation would result in a reduce/reduce conflict on the first expression in the expression_list which cannot be resolved with finite lookahead, since the expression_list could be arbitrarily long.

There is a solution for this case as well, though: it consists of separating id_list from expression_list, something like the following:

id_list:         ID
       |         id_list ',' ID
       ;
expression_list_not_id_list: expression_not_id
                           | id_list ',' expression_not_id
                           | expression_list_not_id_list ',' expression
                           ;
expression_list: expression_list_not_id_list
               | id_list
               ;

I didn't do a full grammar, though, since I have no idea what the target language requires.

Outras dicas

First off, parser theory was always one of my weak points. I mostly work on semantic analyzers.

Second, all the C# parsers I've ever worked on have been hand-generated recursive descent parsers. One of my former colleagues who does have a strong background in parser theory did build his own parser generator and fed the C# grammar into it successfully, but I do not know what sort of egregious hacks doing so entailed.

So what I'm saying here is to take this answer with the appropriate amount of skepticism.


As you note, lambdas are slightly vexing because you've got to be careful about that parenthesized expression -- it might be a parenthesized expression, a cast operator or a lambda parameter list, and the lambda parameter list could be in several different forms. But all things considered, adding lambdas to C# 3.0 was relatively easy, grammatically; hacking up the parser was not too difficult -- it was the semantic analysis that was a bear for lambdas.

The real vexing problems in the C# grammar as far as look-ahead goes are generics and casts.

Generics were added in C# 2, after the language already had >>, > and < operators, all of which can cause weird problems when you throw generics into the mix.

The classic problem is of course A ( B < C, D > ( E ) ) Does the invocation of method A take two arguments: B < C and D > (E) or one, B<C,D>( E )?

The rule to disambiguate is:

If a sequence of tokens can be parsed as a simple-name, member-access, or pointer-member-access ending with a type-argument-list, the token immediately following the closing > token is examined. If it is one of ( ) ] : ; , . ? == != then the type-argument-list is retained as part of the simple-name, member-access or pointer-member-access and any other possible parse of the sequence of tokens is discarded. Otherwise, the type-argument-list is not considered part of the simple-name, member-access or pointer-member-access, even if there is no other possible parse of the sequence of tokens.

The second problem with the grammar goes back to C# 1.0, and that's the cast operator. The problem is that (x)-y could mean "cast -y to type x" or it could mean to subtract y from x. The rule here is:

A sequence of one or more tokens enclosed in parentheses is considered the start of a cast-expression only if at least one of the following are true:

The sequence of tokens is correct grammar for a type, but not for an expression.

The sequence of tokens is correct grammar for a type, and the token immediately following the closing parentheses is the token "~", the token "!", the token "(", an identifier, a literal, or any keyword except as and is.

The rules that disambiguate both cases involve potentially large look-aheads in theory, but in practice you very rarely have to back up the parser very far.

Yes, this situation is a straightforward reduce/reduce conflict.

%token identifier ARROW

%%

program
: expression
| program expression
;

identifier_list
: identifier
| identifier_list identifier;

lambda_arguments
: '(' identifier_list ')'
| identifier;

lambda
: lambda_arguments ARROW expression;

primary_expression
: '(' expression ')'
| identifier
| lambda;


expression : primary_expression


$ yacc -v test.6.y 
conflicts: 1 reduce/reduce

This is exactly from not knowing which reduction to make when the next symbol is ): are we reducing a lambda_arguments list or a primary_expression?

The parser generator has resolved it in the wrong way, by favoring the lambda list. But that means that an parenthesized expression can never be produced.

There are several ways out of this mess. Here is probably the easiest approach, a modified grammar which contains no conflicts:

%token identifier ARROW

%%

program
: expression
| program expression
;

identifier_list
: identifier
| identifier_list identifier
;

lambda_arguments
: '(' identifier identifier_list ')'
| identifier
;

primary_expression
: '(' expression ')'
| '(' expression ')' ARROW expression
| lambda_arguments ARROW expression
| identifier
;

expression : primary_expression

We fold the lambda syntax into primary_expression, and lambda_arguments is now either a single unparenthesized identifier, or a list of at least two identifiers.

Furthermore, there are two syntactic cases now for lambdas:

| '(' expression ')' ARROW expression
| lambda_arguments ARROW expression

So two semantic action rules have to be written. Some of the logic will be common, so it can be farmed out to a helper function which builds the syntax tree node for a lambda.

The action for the first syntactic variant has to inspect the $2 right hand symbol, and check that it is a simple primary expression consisting of an identifier token. If that is the case, the action cracks open the expression, takes out the identifier and builds a lambda list out of that identifier, and uses that list to generate the lambda syntactic node that ends up as the rule's output (the $$ value, in Yacc terms). If $2 is any other kind of expression, then an diagnostic is issued: it is bad lambda syntax, such as ( 2 + 2 ) => foo. Of course, this was accepted by the parser, which is how the rule got invoked. But it it is now being semantically rejected (where semantically refers to a low-calorie version of the word "semantics").

The action for the second variant is straightforward: take the lambda list, body expression and make a lambda node, like before.

Simply put, the lambda syntax is so tightly integrated into expression syntax, that it cannot be easily farmed out into completely separate rules that are brought in via a single production that calls for lambda being reduced to primary_expression. That is wishful thinking, because rules for a shift-reduce parser are not function calls.

I don't think the lambda-expression grammar question is interesting by itself, unless one knows the rest of the language is LALR(1).

If you want to know the answer, feed your subgrammar to an LALR(1) parser generator. If it complains about shift-reduce or reduce-reduce conflicts, it isn't LALR(1). Whether a grammar is LALR(1) is decided by whether you can build the transition tables for it, by definition.

Mostly one wants a parser for the whole language.

There are two interesting questions here.

1) Is C# 4.5 as a language LALR(1) at all? (e.g., is there some grammar which is LALR(1)? Note that a particular grammar not being LALR(1) doesn't mean there isn't another one.

2) Are any of the C# grammars published by Microsoft (in its many forms) LALR(1)?

I think Eric has told us that 1) isn't true. That suggests 2) isn't true, too.

C++ requires infinite lookahead to resolve its templates, caused largely by the locally possibilities of ">" being interpreted as "end template args" or "greater than". Since C# copied this, I'd expect it to have infinite lookahead requirements on template resolution too. That's definitely not LALR(1). [There's an additional mess as to whether ">>" can be treated as a shift operator, and "> >" cannot, that you can't fix in the grammar because it can't see the whitespace.]

My company uses GLR for its language processing tools, and we have a C# 4.5 grammar that works just fine. GLR means we don't have to think about how to convert a context-free grammar to an LALR(1) compatible form (e.g., bend, twist, left/right factor, shuffle), or code ad hoc look aheads, etc. and thus we can focus on the problems of processing the code, not parsing.

It does mean that casts and other constructs produce ambiguous trees at parse time, but these are easily resolved if you have type information.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top