XOM Parser Out of heap memory

Question 1

You can use XOM as a streaming parser by extending the NodeFactory so that it doesn't keep the XML in memory, but processes it and then forgets about it. This works well for well for XML that has many smaller nodes wrapped in a container element. For instance, XML like:

<records>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
  <record><a_little_xml/></record>
</records>

There is an example in the XOM documentation of how to extend the NodeFactory: http://www.xom.nu/tutorial.xhtml#Lister

You basically parse the content (at whatever level in the document you are interested in) and then don't add it to the in-memory tree: http://www.xom.nu/tutorial.xhtml#d0e1424

Question 2

One alternative, depending on what you're doing with the document, might be to switch from DOM-based processing to SAX-based processing (or other event-driven serializer interface). That would let you use an internal memory model which was tuned to your needs and thus more efficient than the general DOM, and perhaps avoid building an in-memory model at all if you can serialize from existing data models or are generating content on the fly.

The Xalan XSLT processor, for example, uses a SAX parser to build a custom data model internally rather than the DOM (XSLT, in general, requires random access to the document's contents so some in-memory model is required), and produces output directly to a SAX serializer whenever possible.

Taking that further, you could set up a data model which explicitly pages portions of the document in and out rather than counting on the operating system's swapper. I'm not sure it would be a net win, though.

The DOM's a fine thing, mind you (he says, as one of its authors) -- but as a general-purpose tool it's not the ideal answer for all tasks.

BTW, when debugging Xalan on some of the more complex problems I fairly often set -Xmx higher than 1024m. Whether Java will let you use higher values depends on the JVM and your operating system configuration, but I'd say it's worth double-checking whether you can push that up a bit.