Question

I am writing a parser for a pet project, and for educational purposes, would like to do it by hand instead of using a parser generator. Unfortunately, many online resources (and the compiler course I took at university) assume some sort of LL(k) grammar. I don't want to left factor the grammar though since left-recursive non-terminals such as

E ::= E * E
    | E / E
    | E + E
    | E - E
    | ...

that are possible in bison-like parser generators seem to greatly simplify the grammar.

What would be the simplest way to hand-parse such a grammar? My research so far tells me that recursive descent is not an option for reasons explained here, and that LR parsing using recursive ascent may be a good choice, though several sites pause to mention that it is non-intuitive.

Was it helpful?

Solution

If you want to build a mostly recursive descent parser for a toy language whose only left recursion is in more-or-less standard expression syntax, then you should consider a Pratt parser (Java) (Javascript). Alternatively (and equivalently, as it turns out), you can use an operator precedence parser; the algorithm is well described in the Dragon book and there are lots of available examples using the shunting yard algorithm (but beware; many people who enthusiastically discover this algorithm and immediately write it up for their blog are far from reliable authorities.)


Note for loose parsing:

If you want to parse an expression grammar, and you don't care about operator precedence (for example, if you only need to syntax colour the code), you can easily reframe the expression grammar to avoid left-recursion.

The starting point is this, using * for the Kleene star, ? for optional, and ( ) for grouping:

expression: term (INFIX term)*
term: PREFIX* operand postfix*
operand: ID
       | CONSTANT 
       | '(' expression ')'
       ;
postfix: POSTFIX
       | '[' expression ']'
       | '(' ( expression (',' expression)* )? ')' 
       ;

Recursive descent parsers can easily deal with the * and ? operators in the above, and there is no left-recursion, so that should suffice.

Note that C has the awkwardness of cast expressions, which are syntactically indistinguishable from function calls unless you know that the first parenthesized expression contains a type instead of an expression. However, for loose parsing purposes, it's fine to parse them as though they were function calls, using the above grammar.

C++'s use of angle brackets both as operators and as template parameters is harder to deal with. Many syntax colourers I've seen just ignore the template parameter case altogether; getting them right is a challenge, but might be worthwhile.


Editorial comment:

Ignore this if you like, but I personally think that learning to work with bottom-up LR parsers, particularly GLR parsers, is a far more enriching educational experience than trying to shoehorn a language into a restricted parsing algorithm, particularly one in which the details of the grammar are obscured as code paths. But having said that, it may be valuable to do both a bison and hand-generated parser, if only to see how much more elegant and simple the bison parser can be.

OTHER TIPS

Recursive descent parsers are easy, once you understand that they are just the inverse (or is that the converse?) of BNF.

A recent chunk of one that I wrote:

/**
 * Parse an addition or subtraction expression.
 *
 * <p/><code><pre>
 * AdditiveExpression
 *     MultiplicativeExpression
 *     AdditiveExpression "+" MultiplicativeExpression
 *     AdditiveExpression "-" MultiplicativeExpression
 * </pre></code>
 *
 * @return an expression
 */
Expression parseAdditiveExpression()
    {
    Expression expr = parseMultiplicativeExpression();
    while (peek().getId() == Id.ADD || peek().getId() == Id.SUB)
        {
        expr = new RelOpExpression(expr, current(), parseMultiplicativeExpression());
        }
    return expr;
    }

/**
 * Parse a multiplication / division / modulo expression.
 *
 * <p/><code><pre>
 * MultiplicativeExpression
 *     PrefixExpression
 *     MultiplicativeExpression "*" PrefixExpression
 *     MultiplicativeExpression "/" PrefixExpression
 *     MultiplicativeExpression "%" PrefixExpression
 *     MultiplicativeExpression "/%" PrefixExpression
 * </pre></code>
 *
 * @return an expression
 */
Expression parseMultiplicativeExpression()
    {
    Expression expr = parsePrefixExpression();
    while (true)
        {
        switch (peek().getId())
            {
            case MUL:
            case DIV:
            case MOD:
            case DIVMOD:
                expr = new RelOpExpression(expr, current(), parsePrefixExpression());
                break;

            default:
                return expr;
            }
        }
    }

I know lots of people who swear by the tools that allow the automation of parser generation, but I personally enjoy understanding the details of how a language is lexed and parsed, so I never let the tools do this fun work for me.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top