How are textual data files parsed in modern C++?

https://stackoverflow.com/questions/8007310

21-02-2021
|

سؤال

I am (too) often confronted with the task of having to parse textual data files -- the kind of textual structured data representation you used before "everyone" used XML -- that are some kind of industry standard. (There are too many of these.)

Anyways, the basic task is always taking a text file and stuffing what's in there in some kind of datastructure so that our C++ code can do something with the info.

Now, I have implemented a few simple (and oh so buggy) parsers by hand, and there is little I despise more. :-)

So - I was wondering what the current state of the art is when I want to "parse" structured textual data into a in-memory representation (think: XML data binding for an arbitrary language).

What I found so far was "What parser generator do you recommend", but I'm not so sure I'm after a parser generator (like ANTLR).

Obvious candidates seem to be pegtl and Boost.Spirit but they both seem rather complicated (but at least they're in-language) and last time I tried Spirit, the compiler errors drove me nuts. (And pegtl needs a C++11 compatible compiler which is still a problem here (VC++ 2005).)

So am I missing a simpler solution for just getting something like

/begin COMPU_METHOD
  DEC "  Decimal value"
  RAT_FUNC
  "%3.0"
  "dec"
  COEFFS 0 1.000000 0.000000 0 0.000000 1.000000
/end COMPU_METHOD

into a C++ datastructure? (This is just an arbitrary example of how part of such a file may look. For this format I could (and probably should) buy a library to parse it, as it is widespread enough -- which is not the case for all formats I encounter.)

-- or should I just go for the complexity of, say Boost.Spirit?

المحلول

Boost Spirit

See
- my answer here for a demo that resembles your sample;
- a more advanced, shorter demo here that parses into a tree structure
- more samples search
Coco/R (C++)

I have had good results with this very pragmatic parser generator that supports many lnaguages/platforms using a common grammar format. The speed of parsing is comparable to Boost Spirit (allthough the processing of parsed data may be more efficient using generic programming)

Edit To make things perfectly clear, there never has been a thing that I wasn't able to do with Coco/R.

However, I'm really addicted to the ease with which Spirit deduces attribute type (conversions) for me generically. That is the main timesaver. There is a cost involved though:

learning curve, maintenance
compile time (but parsers don't often change)

نصائح أخرى

I highly recommend biting the bullet and using Boost.Spirit. Although the error messages can be enough to put one out of one's skull, it's been worth it for me. I have used it to implement parsers for under- (or un-) documented custom file formats in a matter of hours, instead of days.

I found that the best way to approach it was to view it as an "std::istream on steroids", since it uses the same double-angle notation to denote separation.

You do not mention how sophisticated were the parsers you created by hand. But I believe such simple files could be definitely parsed by hand crafted routines as long as you split your work to lexical and syntactic parsing performed by dedicated state machines. The first one recognizes tokens like in your example keywords, numbers and strings, and feeds them to the second trying to recognize longer sentences and create corresponding data structures. With a simple files following regular grammars with no ambiguities and other conflicts it should be really simple and manageable.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow