Question

I have a gigantic XML file, like this:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
   </book>
   [... one gazillion more entries ...]
</catalog>

I want to iterate over this file in a streaming fashion, so that I never have to load the whole thing into memory, something like:

InputStream stream = new FileInputStream("gigantic-book-list.xml");
String nodeName = "book";
Iterator it = new StreamingXmlIterator(stream, nodeName);
Document bk101 = it.next();
Document bk102 = it.next();

Also, I'd like this to work with different XML input files, without having to create specific objects (e.g. Book.java).

@McDowell has a promising approach that use XMLStreamReader and StreamFilter at https://stackoverflow.com/a/16799693/13365, but that only extracts a single node.

Also, Camel's .tokenizeXML does exactly what I want, so I guess I should look into the source code.

Was it helpful?

Solution

@XmlRootElement
public class Book {
  // TODO: getters/setters
  public String author;
  public String title;
}

Assuming you want to process data as strongly typed objects you can combine StAX and JAXB using utility types:

  class ContentFinder implements StreamFilter {
    private boolean capture = false;

    @Override
    public boolean accept(XMLStreamReader xml) {
      if (xml.isStartElement() && "book".equals(xml.getLocalName())) {
        capture = true;
      } else if (xml.isEndElement() && "book".equals(xml.getLocalName())) {
        capture = false;
        return true;
      }
      return capture;
    }
  }

  class Limiter extends StreamReaderDelegate {
    Limiter(XMLStreamReader xml) {
      super(xml);
    }

    @Override
    public boolean hasNext() throws XMLStreamException {
      return !(getParent().isEndElement()
               && "book".equals(getParent().getLocalName()));
    }
  }

Usage:

XMLInputFactory inFactory = XMLInputFactory.newFactory();
XMLStreamReader reader = inFactory.createXMLStreamReader(inputStream);
reader = inFactory.createFilteredReader(reader, new ContentFinder());
Unmarshaller unmar = JAXBContext.newInstance(Book.class)
    .createUnmarshaller();
Transformer tformer = TransformerFactory.newInstance().newTransformer();
while (reader.hasNext()) {
  XMLStreamReader limiter = new Limiter(reader);
  Source src = new StAXSource(limiter);
  DOMResult res = new DOMResult();
  tformer.transform(src, res);
  Book book = (Book) unmar.unmarshal(res.getNode());
  System.out.println(book.title);
}

OTHER TIPS

Isn't this precisely what the SAX API achieves ?

SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

I think you need to simply track each book startElement() call, and record the incoming elements/attributes from there. Process upon receipt of the corresponding endElement() call. Remember that characters() can be called multiple times across the same text node.

Use SAX parser then. Check SAX parser tutorial from Oracle

You need to describe what the desired output of your process is, and what your technology constraints are.

Streaming in XSLT 3.0 is still bleeding edge, but many transformations can be expressed very easily. For example with Saxon-EE 9.5 you could compute the average price of the books in a streamed transformation as

<xsl:template name="main">
  <xsl:stream href="books.xml">
    <xsl:value-of select="avg(/books/book/price)"/>
  </xsl:stream>
</xsl:template>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top