missing characters with xmlpullparser

https://stackoverflow.com/questions/9505103

14-11-2019
|

Question

I'm parsing some file with XmlPullParser in Android Everything goes fine except for some especial HTML characters on the text like this:

&iacute; it should be í
&eacute; it should be é

but they are missing on the Strings I extract:

cami&oacute;n it should be camión  and I get camin

and the same with other similar characters.

I don't know exactly where the problem is, if it's on xmlpullparser.getText() or on Java String

How can I solve this?

Solution

The problem is that plain XML does not have HTML entities. é is not defined for XML. You either have to use an HTML parser (as in the above suggestions) or else translate the entities yourself in XmlPullParser.

Your loop would have to be run by nextToken() and not next(); You would have to respond to XmlPullParser.ENTITY_REF

Of course if you can change your input file to encode the characters directly in utf-8 or iso-8859-1 instead of using HTML entities, that would work too.

OTHER TIPS

I found a solution but it's expensive in terms of app size and performance so please let me now if something is wrong or could be done a better way or less steps.

first get a String from the file. pseudo code:
```
 String content = File to string;
```

http://snippets.dzone.com/posts/show/1335

Import commons-lang3-3.1.jar to your project from Apache commons. Feed that string for unescape Html characters:
```
 String output =   StringEscapeUtils.unescapeHtml4(content);
```
Feed that unescaped string to your XmlPullParser:
```
xpp.setInput( new StringReader (output) );
```

And that's it.

As far as HTML unescaping is concerned, it seems that someone were using Html.fromHtml, which comes built-in with the Android platform. Application size-wise, this could be a proper solution; however, performance-wise you may need to do a bit of profiling, since apparently Apache Commons unescaping is reported to be much faster that the Android built-in alternative.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow