Question

I am using expat parser to parse an XML file of around 15 GB . The problem is it throws an "Out of Memory" error and the program aborts .

I want to know has any body faced a similar issue with the expat parser or is it a known bug and has been rectified in later versions ?

Was it helpful?

Solution

I've used expat to parse large files before and never had any problems. I'm assuming you're using SAX and not one of the expat DOM wrappers. If you are using DOM, then that's your problem right there - it would be essentially trying to load the whole file into memory.

Are you allocating objects as you parse the XML and maybe not deallocating them? That would be the first thing I would check for. One way to check if the problem is really with expat or not - if you reduce the program to a simple version that has empty tag handlers (i.e. it just parses the file and does nothing with the results) does it still run out of memory?

OTHER TIPS

I don't know expat at all, but I'd guess that it's having to hold too much state in memory for some reason. Is the XML mal formed in some way? Do you have handlers registered for end tags of large blocks?

I'm thinking that if you have a handler registered for the end of a large block, and expat is expected to pass the block to the handler, then expat could be running out of memory before it's able to completely gather that block. As I said, I don't know expat, so this might not be possible, I'm just asking.

Alternately, are you sure that expat is where the memory loss is? I could imagine a situation where you were keeping some information about the contents of the XML file, and your own data structures, either because the data was so large, or because of memory leaks in your code, caused the out of memory condition.

Expat is an event-driven parser which does not construct large in-memory structures. So it's probably not expat (which is very widely used for parsing large files) that is the problem - much more likely it is your own code.

Expat has leaks - I've started using it in a long-running server, and am finding that it consistently leaks memory, whether the parser is freed or not. More recent versions of xmlparse.c do not resolve this problem, only hide existing leaks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top