Question

I have a set of tools which index large XML file (MediaWiki dump files) and use those indeces for random access to the individual records stored in the file. It works very well but I'm "parsing" the XML with string functions and/or regular expressions rather than a real XML parser which is a fragile solution should the way the files are created be changed in the future.

Do some or most XML parsers have ways to do such things?

(I have versions of my tools written in C, Perl, and Python. Parsing the entire files into some kind of database or mapping them into memory are not options.)

UPDATE

Here are rough statistics for comparison: The files I am using are mostly published each week or so, the size of the current one is 1,918,212,991 bytes. The C version of my indexing tool takes a few minutes on my netbook and only has to be run once for each new XML file published. Less often I use the same tools on another XML file whose current size is 30,565,654,976 bytes and was updated only 8 times in 2010.

Was it helpful?

Solution 3

VTD-XML looks to be the first serious attempt at addressing this problem:

The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

(VTD-XML even has its very own tag here on StackOverflow so you can follow questins about it etc: )

OTHER TIPS

I think you should store this data in an XML database such as exist-DB, rather than creating your own tools to do a very small subset of what an XML database gives you.

If you're using Python, try lxml - it's very fast and flexible, and it will compare quite well with regexes for speed. Much faster than the alternatives, in any language - without compromise.

Use iterparse to step through the wikipedia articles.

Note that this does not give your random access to the articles in your dump (which is a perfectly reasonable request!) - but iterparse will give you a fast and easy to use 'forward-only' cursor... and lxml might be the right tool to use to parse chunks fseek'd to through other means.

Here's the best documentation I've found for it:

http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/index.html

(try the pdf version)

It's now part of the standard python distribution.

XML is a structured format. As such random access does not really make much sense - you must know where you are going.

Regular expression also needs the whole string to be loaded into memory. This is still better than DOM since DOM usually takes 3-4 times more memory than the size of the XML file.

Typical solution for these cases is SAX where these have a really small memory foot-print but they are like a forward-only cursor: hence you are not accessing randomly, you have to traverse the tree to get where you need. If you are using .NET, you can use XmlTextReader.

Indexes are also useful if the XML does not update often since creating such indexes can be expensive.

XPath is far better than string/regex "parsing", but xpath works with xml documents being parsed into memory DOM first, if your documents are really large you might get memory problems.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top