Question

I have an XML file which starts with <?xml version="1.0" encoding="iso-8859-2"?>. I read it the following way:

SAXParserFactory.newInstance().newSAXParser().parse(is, handler);

where is is an InputStream and handler is some arbitrary handler. Then I get this exception:

org.apache.harmony.xml.ExpatParser$ParseException: At line 41152, column 17: not well-formed (invalid token)

Actually there is a degree sign at that position, enclosed in a CDATA like this:

<![CDATA[something °]]>

Using the charset iso-8859-2, the parser should accept almost any character, including this one. This seems not to be the case. What am I doing wrong?

EDIT

I'm doing all this on Android.

Weird: it seems that the parser completely ignores the encoding attribute. I converted the file to UTF-8 while leaving the header as is, and now my program can read it without error. Why is that??

(I'm making the InputStream like this: new BufferedInputStream(new FileInputStream(filename)), i.e. without a reader, so that cannot be the error.)

Was it helpful?

Solution

I worked around the error by recognizing the encoding manually. I peeked the XML header and looked for the encoding attribute (if available), extracted as a String, created a Java Charset object from it by Charset.forName(), then made a Reader with the given encoding and an InputSource over that Reader like this:

String encoding;
Charset charset;
[...]
    Reader reader = new BufferedReader(new InputStreamReader(inputStream, charset));
    InputSource inputSource = new InputSource(reader);
    inputSource.setEncoding(encoding);
    SAXParserFactory.newInstance().newSAXParser().parse(inputSource, myHandler);

Unfortunately I still don't know why the encoding could not be recognized automatically by the parser.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top