Scannerless Parser Generators

https://stackoverflow.com/questions/2224373

19-09-2019
|

Question

Prologue: Although the set of languages recognized by parsers (context-free grammars) is strictly larger than the one of scanners (regular grammars), most parser generators need a scanner.

(Please don't try to explain the reasons behind it, I know them quite well).

I've seen parsers, that do not require a scanner like

Elkhound (optional, Tomita/GLR)
DParser (Tomita/GLR)
SDLR (Tomita/GLR)

There are certain advantages of using no scanner:

Just one concept (context-free grammars) instead of two
Parse multiple languages in one file (like HTML/Javascript)
Parse languages without reserved keywords (like PL/1)

Often, "workarounds" are used like switching the scanner on the parser's request.

Question: Do you know any other scannerless parser generators (any language)? Are these practical in use (or purely academic)? Are there any other approaches than Tomita/GLR?

Answers:

boost::spirit::qi by AraK
Parsing Expression Grammars (LPEG for Lua) by Norman Ramsey
Yakker by Norman Ramsey
MBase by SK-logic
Waxeye by Trevor Robinson

Solution

Two others:

Bryan Ford's Parsing Expression Grammars (PEG) require no scanner. Efficient, lazy "packrat parser" is optional. I've had nothing but good experience with the Lua LPEG version, which compiles to an efficient bytecode machine. Quite practical.
YAKKER looks very intriguing although it is still clearly in a pre-release state. They are using what they claim to be an efficient variation on Earley's parsing algorithm.

I'm actually a huge fan of scannerless parsers; they simplify configuration enormously. And typical scanner generators, to put it mildly, are not much fun to use. From the man page for Lex:

The asteroid to kill this dinosaur is still in orbit.

Finally, I have no personal experience with Elkhound, but the secondhand reports I hear are impressive. I would say there's no question but that some scannerless parser generators are very practical.

OTHER TIPS

Parser generators don't need a scanner. But you are pretty much crazy if you don't use one.

Parsers built by parser generators don't care what you feed them, as long as you call them tokens.

To build use a parser generator without a scanner, simply define your grammar down to the character level, and feed individual characters to the parser as tokens.

The reason this is crazy is that parsing is a more complex activity than lexing. You can build lexers as finite state machines, which translate to machine code pretty much as "compare and jump to next state". For speed, that's really hard to beat. Parser generators construct parsers that do recursive descent predictive parsing (for most LL generators such as ANTLR) or do table lookups by hashing, binary or linear search, etc. So a parser spends a lot more energy on a token than a lexer spends on a character.

If you feed characters to a parser as tokens, it will then spend correspondingly more energy on the character than equivalent lexer will. If you process a lot of input text, this will eventually matter, whether you do it for zillions of small input streams or a few really large ones.

The so-called scannerless GLR parsers suffer from this performance problem, relative to GLR parsers that are designed to use tokens.

My company builds a tool, the DMS Software Reengineering Toolkit which uses a GLR parser (and successfully parses all the common langauges you know and lot of odder ones you don't, because it has a GLR parser). We knew about scannerless parsers and chose not to implement them because of the speed differential; we have a classically styled (but extremely powerful) LEX-like subsystem for defining lexical tokens. In the one case where DMS went nose-to-nose against an XT (a tool with a scannerless GLR parser) based tool processing the same input, DMS appeared to be 10x as fast as the XT package. To be fair, the experiment done was ad hoc and not repeated, but as it matched my suspicions I saw no reason to repeat it. YMMV. And of course, if we want to go scannerless, well, its pretty easy to write a grammer with character terminals, as I have already pointed out.

GLR Scannerless parsers do have another very nice property that won't matter to most people. You can take two separate grammars for a scannerless parser, and literally concatenate them, and still get a parser (often with a lot of ambiguities). This matters a lot when you are building one language embedded inside another. If that's not what you are doing, this is just an academic curiosity.

And, AFAIK, Elkhound isn't scannerless. (I could be wrong on this). (EDIT: 2/10: Looks like I was wrong. Won't be the first time in my life :)

boost::spirit::qi doesn't need a lexer although you you can use boost::spirit::lex as a front-end. From the introduction of boost::spirit::lex:

In fact, Spirit.Qi allows you to write parsers without using a lexer, parsing the input character stream directly, and for the most part this is the way Spirit has been used since its invention.

boost::spirit(the original one) actually worked as you wanted exactly, but it has been moved to boost::spirit::classic. spirit::qi, spirit::lex are the new design of spirit, so I left the classical version out :)

Waxeye: Scannerless parser based on parsing expression grammars (PEGs).

Sorry for a necropost. You can try this out - an embeddable PEG/Packrat implementation for .NET:

http://www.meta-alternative.net/pfdoc.pdf

http://www.meta-alternative.net/mbase.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow