Il miglior generatore di parser per analizzare molti piccoli testi in C ++?

https://stackoverflow.com/questions/8356206

27-10-2019
|

Domanda

Per motivi di prestazioni, sto eseguendo il porting di una libreria C # in C ++. Durante il normale funzionamento, questa libreria ha bisogno, tra le altre cose, di analizzare circa 150'000 espressioni matematiche (si pensi alle formule Excel) con una lunghezza media inferiore a 150 caratteri.

Nella versione C #, ho utilizzato il parser GOLD per generare il codice di analisi. Può analizzare tutte le 150.000 espressioni in meno di un secondo.

Poiché stavamo pensando di estendere il nostro linguaggio, ho pensato che il passaggio a C ++ potrebbe essere una buona occasione per passare a ANTLR. Ho portato la (semplice) grammatica su ANTLR e ne ho generato il codice C. L'analisi delle 150'000 espressioni richiede più di 12 secondi, perché per ogni espressione, ho bisogno di creare un nuovo ANTL3_INPUT_STREAM, token stream, lexer e parser - non c'è, almeno nella versione 3.4, alcun modo per riutilizzarli.

Sarei grato se qualcuno potesse suggerirmi cosa usare invece - GOLD è ovviamente un'opzione anche se la generazione di codice C ++ o C sembra molto più complicata della varietà C #. La mia grammatica è compatibile con LALR e LL (1). La preoccupazione principale è analizzare le prestazioni su piccoli input.

Soluzione

I would try boost::spirit. It is often extreamly fast (even for parsing simple things like an integer it can be faster than the C function atoi http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html)

http://boost-spirit.com/home/

It has nice things : header only, so dependency hell, liberal licence.

However be warned that the learning curve is difficult. It's modern C++ (no pointer, but a lot of template and very frustrating compiling errors), so coming from C or C#, you might not be very comfortable.

Altri suggerimenti

If the grammar to be parsed is simple, you might just write the parser by hand.

Most parser generators are designed to make it easy to whip up a working parser, and execution time often suffers as a result.

The best performance I have seen in parsing came from Boost.Spirit.Qi which expresses the grammar in C++ using meta-template programming. It is not for the faint of heart though.

This will need be well insulated, and the compilation time of the file containing the parser will increase to several seconds (so better make sure there is as few as possible there).

If the syntax of your expression is simple enough, consider making a hand-written recursive descent parser. It can run really fast, and give you the ability (with enough care) to report nicely and specifically syntax errors.

You could also use bison, but I believe a hand-written recursive parser would probably go faster.

And you can do lexing with a flex generated lexer, and do the parsing manually, in a recursive descent way.

For your information the GCC compiler has its own recursive-descent parsers for C++ & C at least. It does not use parser generators (like bison or ANTLR) anymore.

Instead of expr make you grammar recognize sequence-of-expr.

EDIT:

Instead of having (bison syntax):

start: expr { process_expr ($1); }
     ;

have:

start: expr_seq ;

expr_seq:   expr          { process_expr ($1); }
          | expr_seq expr { process_expr ($2); }
          ;

I've written many parsers, and hand-coded recursive-descent is the way I do it. They are easy to write and pretty much optimal.

That said, if speed is what you're after, no matter what you write there will be plenty of room to speed it up. These will be in ways that could surprise you, because anything you could think of, you would have already done.

Here's a slide set showing how to do it.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow