Question

I am parsing files using XOM library. The Java application works well but I meet a memory problem when I parse large files more than 200 MBs.

I face a heap size memory when I build the file using the below piece of code

        Builder profileFileBuilder = new Builder(profileFileXMLReader);
        Document profileFileDocument = profileFileBuilder.build(profileFile);

What are my alternatives to build files with that size?. I tried to allocate more memory to the JVM but it doesn't accept more than 1024 MBs

Thank you in advance

Was it helpful?

Solution

You can use XOM as a streaming parser by extending the NodeFactory so that it doesn't keep the XML in memory, but processes it and then forgets about it. This works well for well for XML that has many smaller nodes wrapped in a container element. For instance, XML like:

<records>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
</records>

There is an example in the XOM documentation of how to extend the NodeFactory: http://www.xom.nu/tutorial.xhtml#Lister

You basically parse the content (at whatever level in the document you are interested in) and then don't add it to the in-memory tree: http://www.xom.nu/tutorial.xhtml#d0e1424

OTHER TIPS

One alternative, depending on what you're doing with the document, might be to switch from DOM-based processing to SAX-based processing (or other event-driven serializer interface). That would let you use an internal memory model which was tuned to your needs and thus more efficient than the general DOM, and perhaps avoid building an in-memory model at all if you can serialize from existing data models or are generating content on the fly.

The Xalan XSLT processor, for example, uses a SAX parser to build a custom data model internally rather than the DOM (XSLT, in general, requires random access to the document's contents so some in-memory model is required), and produces output directly to a SAX serializer whenever possible.

Taking that further, you could set up a data model which explicitly pages portions of the document in and out rather than counting on the operating system's swapper. I'm not sure it would be a net win, though.

The DOM's a fine thing, mind you (he says, as one of its authors) -- but as a general-purpose tool it's not the ideal answer for all tasks.

BTW, when debugging Xalan on some of the more complex problems I fairly often set -Xmx higher than 1024m. Whether Java will let you use higher values depends on the JVM and your operating system configuration, but I'd say it's worth double-checking whether you can push that up a bit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top