Question

I need to efficiently parse potentially very large XML files (and hence cannot put the whole file in memory). As such I've looked into streaming techniques like XMLStreamReader, however these appear to be very low-level and produce very hard-coded code:

   event = parser.next();
   switch (event)
   {
    case XMLStreamConstants.START_ELEMENT:
         elementName = parser.getLocalName();
         if (elementName.equals("name")){
             state = FOUND_A_NAME;
         }else if (elementName.equals("address")){
             state = FOUND_AN_ADDRESS;                      
         }
    ETC...
    }

I am looking for a way to do this without so tightly coupling the parser with the thing to parse, and in addition, this code just does not feel right. It seems like this should be more truly event-oriented.

Any advice?

Was it helpful?

Solution

SAX has events that do exactly what you think they should.. :) http://www.saxproject.org/quickstart.html shows a simple codebase that does that. Am I missing something?

OTHER TIPS

If you're looking for a higher-level language for processing XML in streaming mode, and if you don't mind being at the bleeding edge, consider the streaming facilities in Saxon-EE 9.3 XSLT - a partial implementation of the draft XSLT 3.0 specification.

http://www.saxonica.com/documentation/sourcedocs/streaming.xml

This can be written generic. For example I have a properties file that has mapping between xml element name and class field name/ hashmap key name.

if (event.isStartElement()) {
 if  (event.asStartElement().getName().getLocalPart().equals(XMLElementName)) {

    event = eventReader.nextEvent();
    fields.put(classFieldName, event.asCharacters().getData());
        continue;
 }
}

this helps us to have one parser to parse different xml messages. This is just an idea.. we can do more ..

I don't think the tightly-coupled nature of your code is anything to do with StAX, that's just the way you've chosen to write it.

You could easily refactor that code to delegate handling of the events to handler objects, using a lookup table of, for example, element names to handler objects. This mechanism coulpe be entirely generic and reusable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top