Question

I'm parsing some file with XmlPullParser in Android Everything goes fine except for some especial HTML characters on the text like this:

í it should be í
é it should be é

but they are missing on the Strings I extract:

camión it should be camión  and I get camin 

and the same with other similar characters.

I don't know exactly where the problem is, if it's on xmlpullparser.getText() or on Java String

How can I solve this?

Was it helpful?

Solution

The problem is that plain XML does not have HTML entities. é is not defined for XML. You either have to use an HTML parser (as in the above suggestions) or else translate the entities yourself in XmlPullParser.

Your loop would have to be run by nextToken() and not next(); You would have to respond to XmlPullParser.ENTITY_REF

Of course if you can change your input file to encode the characters directly in utf-8 or iso-8859-1 instead of using HTML entities, that would work too.

OTHER TIPS

I found a solution but it's expensive in terms of app size and performance so please let me now if something is wrong or could be done a better way or less steps.

  1. first get a String from the file. pseudo code:

     String content = File to string;
    

http://snippets.dzone.com/posts/show/1335

  1. Import commons-lang3-3.1.jar to your project from Apache commons. Feed that string for unescape Html characters:

     String output =   StringEscapeUtils.unescapeHtml4(content);
    
  2. Feed that unescaped string to your XmlPullParser:

    xpp.setInput( new StringReader (output) );
    

And that's it.

As far as HTML unescaping is concerned, it seems that someone were using Html.fromHtml, which comes built-in with the Android platform. Application size-wise, this could be a proper solution; however, performance-wise you may need to do a bit of profiling, since apparently Apache Commons unescaping is reported to be much faster that the Android built-in alternative.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top