Question

In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.

statement : variable ':=' expression
          | functionCall

functionCall : ID'(' (expression ( ',' expression )*)? ')'
             ;
variable : ID
         | ID'.'variable
         | ID '[' expression ']'
         ;

How could I adapt the lexer for LL(1) parser to support k look ahead tokens? Are there some elegant ways?

I know I can add some buffers for tokens. I'd like to discuss some details of programming.


this is the Parser:

class Parser
{
    private Lexer lex;
    private Token look;
    public Parser(Lexer l)
    {
        lex = l;
        move();
    }

    private void move()
    {
        look = lex.scan();
    }
}

and the Lexer.scan() returns the next token from the stream.

Was it helpful?

Solution

In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.

In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)

Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)

Two notes:

1) I'm skeptical about the rule:

functionCall : ID'(' (expression ( ',' expression )*)* ')'
             ;

That would allow, for example:

function(a[3], b[2] c[x] d[y], e.foo)

which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:

functionCall : ID'(' (expression ( ',' expression )*)? ')'
             ;

2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top