Question

OK, I understand this question may sound quite opinion-based, however since I have several specific criteria of choice, I think it would make a nice fit for SO. So, here I am...

I've worked with compiler/interpreter construction in the past quite a lot (mostly as a hobby obviously) and for some reason I stuck with Lex/Yacc (or Flex/Bison, I'm quite confused as to how they call them now... lol).

However, since I'm finding myself currently playing with yet another hobbyist interpreter project, I thought I should give a try to something different maybe in order to avoid what I don't like about Lex/Yacc.

So, namely :

  • Better C++-friendly (than C-friendly)
  • Good documentation (preferably with some existing grammars already implemented + instructions on how to compile/use them - sounds rather obvious, huh?)
  • Could be LALR, LL(*), Recursive descent, I don't really care (note: an input as to which type you would prefer and for what type of implementations would be great though; I've never really understood their pros and cons, to be perfectly honest, even though I do know what they refer to)
  • Combining the Lexer part and the Parser grammar in one file wouldn't be bad at all; never really got it why it has to be split in two.
  • Last but not least : I've always had issues with... issues. I mean - at least as far as Lex/Yacc go, parsing error messages are more-or-less cryptic (Syntax Error... Yuhuu!) and rarely do they help diagnose an issue. (Well, unless you are the one who developed the interpreter... lol). So, is there anything better than Lex/Yacc regarding error reporting?

OK, I hope that wasn't too verbose. I'm all ears! :-)

Was it helpful?

Solution

I've been building parser generators and parsers since 1969.

Recursive descent, YACC and JavaCC are the typical answers you hear.

These are your grandpa's parser generators, and suffer from limitations in the grammars they will accept. Invariably, (esp on Stack Overflow), some poor soul asks "how do I solve this shift/reduce" problem (for LR parser generators like YACC) or "how do I eliminate left recursion" (for recursive descent or LL parser generators like JavaCC). Worse, they can't handle grammars that really have syntactic ambiguity, as occurs in most complex languages.

GLR (and GLL) parsers allow you to write context-free grammars ... and parse them, with no fuss or muss. This is a real productivity enhancement. There's a price: you can end up with ambiguous parses but there are ways to handle that. (see this discussion of C++ parsing troubles that neither YACC nor JavaCC can handle by themselves).

Bison (widely available) has a GLR option; use it! Recent multilingual program manipulation tools seems to all use GLL or GLR. Our DMS Software Reengineering Toolkit uses GLR and parses C++ (full C++14 in MS and GNU variants!), Java, COBOL, and a raft of other complicated languages; GLR has been one of the best technical choices I've made in my career. Stratego uses GLR. I think RascalMPL uses GLL. Scott McPeak's Elkhound GLR parser generator is C++ based and generates, I'm pretty sure, C++ code (OP asked for a C++ based answer).

Hot topics these days are PEG and ANTLR4. These are better that LL or LR parsers but still give one grief in trying to shape the grammar. (With PEG, you have to order the productions, assuming you can find such an order, to handle ambiguous rules with priorities. With ANTLR4, you still have specify lookaheads to resolve ambiguity; I don't know how it handles infinite lookahead). AFAIK, nobody has built practical C++ parsers with either of these technologies, so they aren't living up to their reputation.

I think GLR and GLL are much, much better answers.

OTHER TIPS

I'm just going to answer the last question, with a slight edit:

at least as far as Lex/Yacc go, parsing error messages are more-or-less cryptic (Syntax Error... Yuhuu!) and rarely do they help diagnose an issue. (Well, unless you are the one who developed the interpreter... lol). So, is there anything better than a better way to use Lex/Yacc regarding error reporting?

Well, start by using a modern version of bison, which has a reasonably complete manual online (and quite possibly installed with the executable, depending on how you install bison). In particular, start with these declarations:

%define parse.error verbose
%define parse.lac full

That will at least replace the mysterious "syntax error" error with a list of "expected" token types.

Then make sure that your token types have meaningful names, because they will get presented to the user as part of the error message. If you're used to using IDENTIFIER as a terminal, then you're probably ok, but the message "Expected TOK_YY_ID" is a bit geeky. You can declare a readable for a terminal in the type declaration:

%type TOK_YY_ID "identifier"

That will only take you so far. In many cases, knowing what was "expected" is sufficient to understand a syntax error, but sometimes it's useful to be more explicit. In such cases, it's useful to actually define error rules. Getting these right is more an art than a science, but that's true of all approaches to error reporting/recovery; the key is to try to be as specific as possible about what the erroneous syntax looks like, and not more specific than necessary.

One interesting approach to error reporting is to use the current parser state and lookahead token (both of which are visible at the moment of error reporting) to lookup up a custom error message, if one exists. I think this approach has been a part of compiler folklore for a long time, and I'm sure I've seen several articles about it over the decades. Here is a relatively recent article by Russ Cox.

Interesting question - not sure I have a great answer to your actual question, but my "comment" got a bit too long for a comment...

I'm working in a Pascal compiler, and I've pretty much written a Lexer, Tokenizer and Parser (including producing AST to go into a code generator for LLVM) in around 1100 lines, if I may say so myself, pretty "nice", of C++ code - all by hand. It's much more friendly towards generating good error messages, and it helps in. There are several bits missing, and I still have plenty of work left before my compiler is complete, but I can compile some fairly complex code.

I admit, I've never used Lex/Yacc or Flex/Bison for anything real. I have sometimes looked at it, but I find it hard to use these tools, and you either end up taking the generated code and modifying it (bad idea with autogenerated code), or with bad error handling, and hard to debug code on top of that. But then, I just spend about two hours trying to find an error caused by "eating" a semicolon too early, in turn causing the parser to get lost in the token flow...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top