I am trying to parse a giant (> 1GB) xml file using Java's XMLStreamReader. I use the getText() method to pull the contents of a node. The xml file I have is encoded as ISO-8859-1, and some characters have special encoding, for example & is encoded as & in the file.

So if the file contains, for example:

<person>Jack</person>
<person>Jill</person>
<persons>Jack &amp; Jill</persons>

And I try to get the contents of each node using getText(), the 3rd node only returns Jack. Any time a &xxx; character is encountered, no characters after it (in the same node) are parsed or returned.

Where is the problem? Is the xml file encoded correctly? Am I using the Java parser correctly?

Thanks!

有帮助吗?

解决方案

I suspect that the problem is that the parser has split the contents of the 3rd persons elements into multiple processing events. (This behaviour of next() is documented.) Calling getText() is only giving you the text for the current event.

Try using getElementText() instead.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top