Question

While reading an XML file using StAX and XMLStreamReader, I encountered a weird problem. Not sure if its an error or I am doing something wrong. Still learning StAX.

So the problem is,

  1. In XMLStreamConstants.CHARACTERS event, when I collect node text as XMLStreamReader.getText() method.
  2. If there is &, <, > or even something hidden for instance in node text, it returns only the first part of the text string. e.g. ABC & XYZ returns only ABC

Simplified Java Source:

    // Start StaX reader
    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    try {
        XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(inStream);
        int event = xmlStreamReader.getEventType();
        while (true) {
            switch (event) {
                case XMLStreamConstants.START_ELEMENT:
                    switch (xmlStreamReader.getLocalName()) {
                        case "group":
                        // Do something
                            break;
                        case "source":
                            isSource = true;
                            break;
                        case "target":
                            isTarget = true;
                            break;
                        default:
                            isSource = false;
                            isTrans = false;
                            break;
                    }
                    break;
                case XMLStreamConstants.CHARACTERS:
                    if (srcData != null) {
                        String srcTrns = xmlStreamReader.getText();
                        if (srcTrns != null) {
                            if (isSource) {
                                // Set source text
                                isSource = false;
                            } else if (isTrans) {
                                // Set target text
                                isTrans = false;
                            }
                        }
                    }
                    break;
                case XMLStreamConstants.END_ELEMENT:
                    if (xmlStreamReader.getLocalName().equals("group")) {
                        // Add to return list
                    }
                    break;
            }
            if (!xmlStreamReader.hasNext()) {
                break;
            }
            event = xmlStreamReader.next();
        }
    } catch (XMLStreamException ex) {
        LOG.log(Level.WARNING, ex.getMessage(), MessageFormat.format("{0} {1}", ex.getCause(), ex.getLocation()));
    }

I am not quite sure what exactly I am doing wrong or how to collect complete text of the node.

Any suggestions or tips would be a great help to move on learning StAX more. :-)

Was it helpful?

Solution

I have solved the problem after struggling and researching a bit.

It was a problem reading text with escaped entity references. You need to set XMLInputFactory IS_COALESCING to true

XMLInputFactory.setProperty(XMLInputFactory.IS_COALESCING, true);

Basically this tells the parser to replace internal entity references with their respective replacement text (in other words, something like decoding) and read them as normal characters.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top