Question

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!

Was it helpful?

Solution

You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.

OTHER TIPS

My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.

Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.

By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top