Choice of Parser Generator [closed]

Question 1

I've been building parser generators and parsers since 1969.

Recursive descent, YACC and JavaCC are the typical answers you hear.

These are your grandpa's parser generators, and suffer from limitations in the grammars they will accept. Invariably, (esp on Stack Overflow), some poor soul asks "how do I solve this shift/reduce" problem (for LR parser generators like YACC) or "how do I eliminate left recursion" (for recursive descent or LL parser generators like JavaCC). Worse, they can't handle grammars that really have syntactic ambiguity, as occurs in most complex languages.

GLR (and GLL) parsers allow you to write context-free grammars ... and parse them, with no fuss or muss. This is a real productivity enhancement. There's a price: you can end up with ambiguous parses but there are ways to handle that. (see this discussion of C++ parsing troubles that neither YACC nor JavaCC can handle by themselves).

Bison (widely available) has a GLR option; use it! Recent multilingual program manipulation tools seems to all use GLL or GLR. Our DMS Software Reengineering Toolkit uses GLR and parses C++ (full C++14 in MS and GNU variants!), Java, COBOL, and a raft of other complicated languages; GLR has been one of the best technical choices I've made in my career. Stratego uses GLR. I think RascalMPL uses GLL. Scott McPeak's Elkhound GLR parser generator is C++ based and generates, I'm pretty sure, C++ code (OP asked for a C++ based answer).

Hot topics these days are PEG and ANTLR4. These are better that LL or LR parsers but still give one grief in trying to shape the grammar. (With PEG, you have to order the productions, assuming you can find such an order, to handle ambiguous rules with priorities. With ANTLR4, you still have specify lookaheads to resolve ambiguity; I don't know how it handles infinite lookahead). AFAIK, nobody has built practical C++ parsers with either of these technologies, so they aren't living up to their reputation.

I think GLR and GLL are much, much better answers.

Question 2

I'm just going to answer the last question, with a slight edit:

at least as far as Lex/Yacc go, parsing error messages are more-or-less cryptic (Syntax Error... Yuhuu!) and rarely do they help diagnose an issue. (Well, unless you are the one who developed the interpreter... lol). So, is there ~~anything better than~~ a better way to use Lex/Yacc regarding error reporting?

Well, start by using a modern version of bison, which has a reasonably complete manual online (and quite possibly installed with the executable, depending on how you install bison). In particular, start with these declarations:

%define parse.error verbose
%define parse.lac full

That will at least replace the mysterious "syntax error" error with a list of "expected" token types.

Then make sure that your token types have meaningful names, because they will get presented to the user as part of the error message. If you're used to using IDENTIFIER as a terminal, then you're probably ok, but the message "Expected TOK_YY_ID" is a bit geeky. You can declare a readable for a terminal in the type declaration:

%type TOK_YY_ID "identifier"

That will only take you so far. In many cases, knowing what was "expected" is sufficient to understand a syntax error, but sometimes it's useful to be more explicit. In such cases, it's useful to actually define error rules. Getting these right is more an art than a science, but that's true of all approaches to error reporting/recovery; the key is to try to be as specific as possible about what the erroneous syntax looks like, and not more specific than necessary.

One interesting approach to error reporting is to use the current parser state and lookahead token (both of which are visible at the moment of error reporting) to lookup up a custom error message, if one exists. I think this approach has been a part of compiler folklore for a long time, and I'm sure I've seen several articles about it over the decades. Here is a relatively recent article by Russ Cox.

Question 3

Interesting question - not sure I have a great answer to your actual question, but my "comment" got a bit too long for a comment...

I'm working in a Pascal compiler, and I've pretty much written a Lexer, Tokenizer and Parser (including producing AST to go into a code generator for LLVM) in around 1100 lines, if I may say so myself, pretty "nice", of C++ code - all by hand. It's much more friendly towards generating good error messages, and it helps in. There are several bits missing, and I still have plenty of work left before my compiler is complete, but I can compile some fairly complex code.

I admit, I've never used Lex/Yacc or Flex/Bison for anything real. I have sometimes looked at it, but I find it hard to use these tools, and you either end up taking the generated code and modifying it (bad idea with autogenerated code), or with bad error handling, and hard to debug code on top of that. But then, I just spend about two hours trying to find an error caused by "eating" a semicolon too early, in turn causing the parser to get lost in the token flow...