Question

I'm working on a compiler (language close to C) and I've to implement it in C. My main question is how to choose the right parsing method in order to be efficient while coding my compiler.

Here's my current grammar: http://img11.hostingpics.net/pics/273965Capturedcran20130417192526.png

I was thinking about making a top-down parser LL(1) as described here: http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/07-Top-Down-Parsing.pdf

Could it be an efficient choice considering this grammar, knowing that I first have to remove the left recursive rules. Do you have any other advices?

Thank you, Mentinet

Was it helpful?

Solution 4

The most efficient parsers are LR-Parsers and LR-parsers are bit difficult to implement .You can go for recursive descent parsing technique as it is easier to implement in C.

OTHER TIPS

The most efficient way to build a parser is to use a specific tool which purpose of existance is to build parsers. They used to be called compiler compilers, but nowadays the focus has shifted (broadened) to language workbenches which provide you with more aid to build your own language. For instance, almost any language workbench would provide you with IDE support and syntax highlighting for your language right off the bat, just by looking at a grammar. They also help immensely with debugging your grammar and your language (you didn’t expect left recursion to be the biggest of your problems, did you?).

Among the best currently supported and developing language workbenches one could name:

If you really so inclined, or if you consider writing a parser yourself just for amusement and experience, the best modern algorithms are SGLR, GLL and Packrat. Each one of those is a quintessence of algorithmic research that lasted half a century, so do not expect to understand them fully in a flash, and do not expect any good to come out of the first couple of “fixes” you’ll come up with. If you do come up with a nice improvement, though, do not hesitate to share your findings with the authors or publish it otherwise!

Lots of answers here, but they get things confused. Yes, there are LL and LR parsers, but that isn't really your choice.

You have a grammar. There are tools that automatically create a parser for you given a grammar. The venerable Yacc and Bison do this. They create an LR parser (LALR, actually). There are also tools that create an LL parser for you, like ANTLR. The downsides of tools like this are they inflexible. Their automatically generated syntax error messages suck, error recovery is hard and the older ones encourage you to factor your code in one way - which happens to be the wrong way. The right way is to have your parser spit out an Abstract Syntax Tree, and then have the compiler generate code from that. The tools want you to mix parser and compiler code.

When you are using automated tools like this the differences in power between LL, LR and LALR really does matter. You can't "cheat" to extend their power. (Power in this case means being able to generate a parser for valid context free grammar. A valid context free grammar is one that generates a unique, correct parse tree for every input, or correctly says it doesn't match the grammar.) We currently have no parser generator that can create parser for every valid grammar. However LR can handle more grammars than any other sort. Not being able to handle a grammar isn't a disaster as you can re-write the grammar in a form the parser generator can accept. However, it isn't always obvious how that should be done, and worse it effects the Abstract Syntax Tree generated which means weaknesses in the parser ripple down through the rest of your code - like the compiler.

The reason there are LL, LALR and LR parsers is a long time ago, the job of generating a LR parser was taxing for a modern computer both in terms of time and memory. (Note this is the it takes the generate the parser, which only happens when you write it. The generated parser runs very quickly.) But that was a looong time ago. Generating a LR(1) parser takes far less than 1GB of RAM for a moderately complex language and on a modern computer takes under a second. For that reason you are far better off with an LR automatic parser generator, like Hyacc.

The other option is you write your own parser. In this case there is only one choice: an LL parser. When people here say writing LR is hard, they understate the case. It is near impossible for a human to manually create an LR parser. You might think this means if you write your own parser you are constrained to use LL(1) grammars. But that isn't quite true. Since you are writing the code, you can cheat. You can lookahead an arbitrary number of symbols, and because you don't have to output anything till you are good and ready the Abstract Syntax Tree doesn't have to match the grammar you are using. This ability to cheat makes up for all of lost power between LL and LR(1), and often then some.

Hand written parsers have their downsides of course. There is no guarantee that your parser actually matches your grammar, or for that matter no checking if your grammar is valid (ie recognises the language you think it does). They are longer, and they are even worse at encouraging you to mix parsing code with compile code. They are also obviously implemented in only one language, whereas a parser generator often spit out their results in several different languages. Even if they don't, an LR parse table can be represented in a data structure containing only constants (say in JSON), and the actual parser is only 100 lines of codes or so. But there are also upsides to hand written parser. Because you wrote the code, you know what is going on, so it is easier to do error recovery and generate sane error messages.

In the end, the tradeoff often works like this:

  • For one off jobs, you are far better using a LR(1) parser generator. The generator will check your grammar, save you work, and modern ones split out the Abstract Syntax Tree directly, which is exactly what you want.
  • For highly polished tools like mcc or gcc, use a hand written LL parser. You will be writing lots of unit tests to guard your back anyway, error recovery and error messages are much easier to get right, and they can recognise a larger class of languages.

The only other question I have is: why C? Compilers aren't generally time critical code. There are very nice parsing packages out there that will allow you to get the job done in 1/2 the code if you willing to have your compiler run a fair bit slower - my own Lrparsing for instance. Bear in mind a "fair bit slower" here means "hardly noticeable to a human". I guess the answer is "the assignment I am working on specifies C". To give you an idea, here is how simple getting from your grammar to parse tree becomes when you relax the requirement. This program:

#!/usr/bin/python

from lrparsing import *

class G(Grammar):
  Exp = Ref("Exp")
  int = Token(re='[0-9]+')
  id = Token(re='[a-zA-Z][a-zA-Z0-9_]*')
  ActArgs = List(Exp, ',', 1)
  FunCall = id + '(' + Opt(ActArgs) + ')'
  Exp = Prio(
      id | int | Tokens("[]", "False True") | Token('(') + List(THIS, ',', 1, 2) + ')' |
      Token("! -") + THIS,
      THIS << Tokens("* / %") << THIS,
      THIS << Tokens("+ -") << THIS,
      THIS << Tokens("== < > <= >= !=") << THIS,
      THIS << Tokens("&&") << THIS,
      THIS << Tokens("||") << THIS,
      THIS << Tokens(":") << THIS)
  Type = (
      Tokens("", "Int Bool") |
      Token('(') + THIS + ',' + THIS + ')' |
      Token('[') + THIS + ']')
  Stmt = (
      Token('{') + THIS * Many + '}' |
      Keyword("if") + '(' + Exp + ')' << THIS + Opt(Keyword('else') + THIS) |
      Keyword("while") + '(' + Exp + ')' + THIS |
      id + '=' + Exp + ';' |
      FunCall + ';' |
      Keyword('return') + Opt(Exp) + ';')
  FArgs = List(Type + id, ',', 1)
  RetType = Type | Keyword('void')
  VarDecl = Type + id + '=' + Exp + ';'
  FunDecl = (
      RetType + id + '(' + Opt(FArgs) + ')' +
      '{' + VarDecl * Many + Stmt * Some + '}')
  Decl = VarDecl | FunDecl
  Prog = Decl * Some
  COMMENTS = Token(re="/[*](?:[^*]|[*][^/])*[*]/") | Token(re="//[^\n\r]*")
  START = Prog

EXAMPLE = """\
Int factorial(Int n) {
  Int result = 1;
  while (n > 1) {
    result = result * n;
    n = n - 1;
  }
  return result;
}
"""

parse_tree = G.parse(EXAMPLE)
print G.repr_parse_tree(parse_tree)

Produces this output:

(START (Prog (Decl (FunDecl
  (RetType (Type 'Int'))
  (id 'factorial') '('
  (FArgs
    (Type 'Int')
    (id 'n')) ')' '{'
  (VarDecl
    (Type 'Int')
    (id 'result') '='
    (Exp (int '1')) ';')
  (Stmt 'while' '('
    (Exp
      (Exp (id 'n')) '>'
      (Exp (int '1'))) ')'
    (Stmt '{'
      (Stmt
        (id 'result') '='
        (Exp
          (Exp (id 'result')) '*'
          (Exp (id 'n'))) ';')
      (Stmt
        (id 'n') '='
        (Exp
          (Exp (id 'n')) '-'
          (Exp (int '1'))) ';') '}'))
  (Stmt 'return'
    (Exp (id 'result')) ';') '}'))))

Thank you for all those advices but we finally decided to build our own recursive-descent parser by using exactly the same method as here: http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html

Indeed, we changed the grammar in order to remove the left-recursive rules and because the grammar I showed in my first message isn't LL(1), we used our token list (made by our scanner) to proceed a lookahead which go further. It looks that it works quite well.

Now we have the build an AST within those recursive functions. Would you have any suggestions? Tips? Thank you very much.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top