A lightweight XML parser efficient for large files?

https://stackoverflow.com/questions/1006543

06-07-2019
|

Question

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.

Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint? The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.

I know about Xerces, but its sheer size of over 50mb gives me shivers.

Thanks!

Solution

If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.

The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.

OTHER TIPS

I like ExPat
http://expat.sourceforge.net/

It is C based but there are several C++ wrappers around to help.

RapidXML is quite a fast parser for XML written in C++.

http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)

I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.

I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.

The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.

If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.

See also the SAX2 interface in libxml

firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.

The easiest way to try it out is with a script in the free firstobject XML editor such as this:

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.

you can try https://github.com/thinlizzy/die-xml . it seems to be very small and easy to use

this is a recently made C++0x XML SAX parser open source and the author is willing feedbacks

it parses an input stream and generates events on callbacks compatible to std::function

the stack machine uses finite automata as a backend and some events (start tag and text nodes) use iterators in order to minimize buffering, making it pretty lightweight

I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.

I highly recommend pugixml

pugixml is a light-weight C++ XML processing library.

"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."

I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.

pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.

The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!

DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.

It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.

Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html

A few examples for (x86):

pugixml is more than 38 times faster than TinyXML

                    4.1 times faster than CMarkup,

                    2.7 times faster than expat or libxml

For (x64) pugixml is the fastest parser which I know.

Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow