Question

I am trying to parse a giant (> 1GB) xml file using Java's XMLStreamReader. I use the getText() method to pull the contents of a node. The xml file I have is encoded as ISO-8859-1, and some characters have special encoding, for example & is encoded as & in the file.

So if the file contains, for example:

<person>Jack</person>
<person>Jill</person>
<persons>Jack &amp; Jill</persons>

And I try to get the contents of each node using getText(), the 3rd node only returns Jack. Any time a &xxx; character is encountered, no characters after it (in the same node) are parsed or returned.

Where is the problem? Is the xml file encoded correctly? Am I using the Java parser correctly?

Thanks!

Was it helpful?

Solution

I suspect that the problem is that the parser has split the contents of the 3rd persons elements into multiple processing events. (This behaviour of next() is documented.) Calling getText() is only giving you the text for the current event.

Try using getElementText() instead.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top