Question

I've a log file with the following structure.

unstructured raw text 
unstructured raw text 
..
..
..

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<message>
...
...
</message> 

unstructured raw text 
..
..


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<message>
...
...
</message> 

unstructured raw text 
..
..

As you can see there are multiple XML documents embedded inside one single log file. I was wondering if there is a generic utility or library that I can reuse here before I start to write something of my own. I need it in Java.

Thanks.

Was it helpful?

Solution

I would favour one of the StAX based parsers, the Woodstox ones are particularly performant. If you then need to use a different type of XML parser you can shunt the events from the parser to a generator and feed that XML into e.g. a DOM based parser or a SAX based parser (if you are a masochist... since SAX is a pain of a parser to use).

You will have pseudo-code that looks a little like this:

BufferedReader br = ...
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
Pattern startOfXml = Pattern.compile("<\\?xml.*\\?>");
String line;
while (null != (line = br.readLine()) {
    if (startOfXml.matcher(line).matches()) {
        XMLEventReader xr = inputFactory.createXMLEventFactory(br);
        XMLEvent event;
        while (!(event = xr.nextEvent()).isEndDocument()) {
            // do whatever you want with the event
        }
    } else {
        // do whatever you want with the plain-text
    }
}

Some of the StAX parsers in certain modes may object to the isEndDocument() and in that case you will have to count event level parsing the document and break out once you reach the root level end element. Also some parsers may cache a few characters after the end of the document... worst case you just need to catch an exception for a "malformed" document when the parser notices text after the end element

OTHER TIPS

You can use xml parsers that are built-in into java, but you have to give them only XML as input. So you should read parts of file that are XML into String, and then parse them as Strings. If you don't know how to parse Strings as XML, see here: In Java, how do I parse XML as a String instead of a file?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top