Question

I'm working on the design of a compiler for a language in which I have to use curly brace for two different purposes. I am currently stuck in writing a non-ambiguous grammar but I realized that some other language have the same behaviour. That is why I'm trying to understand how a part JS/Ecmascript grammar is built.

More precisely, I would like to understand how the parser mangage to correctly identify statement blocks like function body from objects definitions since they are both surrounded with curly braces and their content can be syntaxically the same.

I have searched for JS grammar on the web but most documentation is very messy. I found this, which is very clear : http://hepunx.rl.ac.uk/~adye/jsspec11/llr.htm but the only rule using curly braces is the following :

CompoundStatement:
      { Statements }

Does anyone could help me with this

Was it helpful?

Solution

The grammar you linked to looks incomplete and buggy (it doesn't mention object literals). Let's talk about this one instead. Here, braces occur in four rules:

ObjectLiteral ⇒
   { }
|  { FieldList }

...

Block ⇒ { BlockStatements }

...

SwitchStatement ⇒
   switch ParenthesizedExpression { }
|  switch ParenthesizedExpression { CaseGroups LastCaseGroup }

...

NamedFunction ⇒ function Identifier FormalParametersAndBody
FormalParametersAndBody ⇒ ( FormalParameters ) { TopStatements }

The function and switch cases will never clash with the block and object literal, because these are preceded by the function and switch keywords. As soon as the parser recognizes these keywords, all other alternative productions can be discarded.

However, there appears to be a ambiguity between blocks and object literals. How should the statement {} be parsed? This could either be an empty object literal, or an empty block.

The grammar prevents this by tracking whether an expression is the initial expression of the statement, and provides different productions for each case:

PrimaryExpression(normal) ⇒
   SimpleExpression
|  FunctionExpression
|  ObjectLiteral
PrimaryExpression(initial) ⇒ SimpleExpression

This disallows object literals in the first expression, and also function definitions – if a function occurs at the beginning of a statement, it always is a function definition statement, never a function expression.

This makes the JavaScript grammar unambiguous. This is important for LR grammars, since it avoids reduce/reduce conflicts.

For top-down parsers (e.g. hand-written recursive descent), we could leave the grammar ambiguous but assign priorities to each alternative. E.g. in a Statement and TopStatement, the Block and FunctionDefinition alternatives would always have to be attempted before the ExpressionStatement alternative.

For generalized parsers that can parse all context-free grammars (not just LL or LR subsets), ambiguities do not hinder the parser. Depending on the parser, it either returns one arbitrary parse tree of all valid parse trees (undesirable in practice), or iterates over all possible parse trees. An actually ambiguous grammar is not good for programming languages, since we usually want to write programs that cannot be misunderstood by the compiler.

However, there's an interesting class of unambiguous grammars that cannot be parsed by LR parsers, and requires a generalized parser. Here, we have a reduce/reduce conflict between the A and B rules, but it is just a local ambiguity and is resolved later in the program:

Top = X | Y
X = A Balanced x
Y = B Balanced y
A = a b
B = a b
Balanced = x | y | () | ( Balanced ) | ( Balanced Balanced )

Since the Balanced rule is a context-free grammar in itself, no amount of lookahead will resolve this. The input ab((y()))x has exactly one parse tree (is unambiguous), but is not parseable by common parsers. When designing a language, even local ambiguities should therefore be avoided.

Licensed under: CC-BY-SA with attribution
scroll top