Why do on-line parsers seem to stop at regexps?

https://stackoverflow.com/questions/803515

03-07-2019
|

Question

I've been wondering for long why there doesn't seem to be any parsers for, say, BNF, that behave like regexps in various libraries.

Sure, there's things like ANTLR, Yacc and many others that generate code which, in turn, can parse a CFG, but there doesn't seem to be a library that can do that without the intermediate step.

I'm interested in writing a Packrat parser, to boot all those nested-parenthesis-quirks associated with regexps (and, perhaps even more so, for the sport of it), but somehow I have this feeling that I'm just walking into another halting problem -like class of swamps.

Is there a technical/theoretical limitation for these parsers, or am I just missing something?

Solution

I think it's more of a cultural thing. The use of context-free grammars is mostly confined to compilers, which typically have code associated with each production rule. In some languages, it's easier to output code than to simulate callbacks. In others, you'll see parser libraries: parser combinators in Haskell, for example. On the other hand, regular expressions see wide use in tools like grep, where it's inconvenient to run the C compiler every time the user gives a new regular expression.

OTHER TIPS

Boost.Spirit looks like what you are after.

If you are looking to make your own, I've used BNFC for my latest compiler project and it provides the grammar used in its own implementation. This might be a good starting point...

There isn't and technical/theoretical limitation lurking in the shadows. I can't say why they aren't more popular, but I know of at least one library that provides this sort of "on-line" parsing that you seek.

SimpleParse is a python library that lets you simply paste your hairy EBNF grammar into your program and use it to parse things right away, no itermediate steps. I've used it for several projects where I wanted a custom input language but really didn't want to commit to any formal build process.

Here's a tiny example off the top of my head:

decl = r"""
    root := expr
    expr := term, ("|", term)*
    term := factor+
    factor := ("(" expr ")") / [a-z]
"""
parser = Parser(decl) 
success, trees, next = parser.parse("(a(b|def)|c)def")

The parser combinator libraries for Haskell and Scala also let your express your the grammar for your parser in the same chunk of code that uses it. However you can't, say, let the user type in a grammar at runtime (which might only be of interest to people making software to help people understand grammars anyway).

Pyparsing (http://pyparsing.wikispaces.com) has built-in support for packrat parsing and it is pure Python, so you can see the actual implementation.

Because full-blown context-free grammars are confusing enough as they are without some cryptically dense and incomprehensible syntax to make them even more confusing?

It's hard to know what you're asking. Are you trying to create something like a regular expression, but for context-free grammars? Like, using $var =~ /expr = expr + expr/ (in Perl) and having that match "1 + 1" or "1 + 1 + 1" or "1 + 1 + 1 + 1 + 1 + ..."? I think one of the limitations of this is going to be syntax: Having more than about three rules is going to make your "grammar-expression" even more unreadable than any modern-day regular expression.

Side effect are the only thing I see thing that will get you. Most of the parser generators include embedded code for processing and you would need an eval to make that work.

One way around that would be to name actions and then make an "action" function that takes the name of the action to do and the args to do it with.

You could theoretically do it with Boost Spirit in C++, but it is mainly made for static grammars. I think the reason this is not common is that CFGs are not as commonly used as regexs. I've never had to use a grammar except for compiler construction, but I have used regexs many times. CFGs are generally much more complex than regexs, so it makes sense to generate code statically with a tool like YACC or ANTLR.

tcllib has something like that, if you can put up with Parse Expression Grammars and also TCL. If Perl is your thing CPAN has Parse::Earley. Here's a pure Perl variation which looks promising. PLY seems to be a plausible solution for Python

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow