Problem with charset and rome (rss/atom feeds)

https://stackoverflow.com/questions/5365270

27-10-2019
|

Question

I'm trying to create a feed aggregator using rome (1.0). Everything is working, but I'm facing problems with feed's charset. I'm developing it using java 1.6 over a mac os x (netbeans 6.9.1).

I'm using the following code to retrieve feeds:

InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));

Where _source is a rss source (like http://rss.cnn.com/rss/edition.rss) and _charset is UTF-8 or ISO-8859-1.

It works, but some sites with latin characters (like portuguese) it doesn't even if I use both encodings.

For instance, feeds read from http://oglobo.globo.com/rss/plantaopais.xml will always return dummy characters as following:

Secret�rio de S�o Paulo (UTF-8)
SecretÃ¡rio de SÃ£o Paulo (ISO-8859-1)

Why? Am I missing something?

If I try to use something like UTF-16, rome throws an error: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog.

I've tried other encodings, like US-ASCII with no lucky...

Another question: is rome the best solution to deal with feeds (using java)? The most recent version from rome is 1.0 that is dated from 2009. Seems to be dead...

TIA,

Bob

Solution

I don't know rome (you could have put a link in your question). ISO-8859-1 should be the right encoding to use for the feed you linked. But doesn't your library supports an InputStream as a source (so it would itself look up the right encoding by the XML preamble)?

Could it be that the output is garbled after it's processing by the output of your program? Could you write

System.out.println("S\u00e3o Paulo");

in your program and report its output? (It should be "São Paulo" if your Java + console combination is configured right.)

So, I now downloaded and compiled Rome (which took half an hour of downloading of other stuff by Maven), and I can reproduce the problem. Looks like the build method taking a Reader has problems.

Here is a variant that works (if rome, jdom and xerces are in the class path):

package de.fencing_game.paul.examples.rome;

import org.xml.sax.InputSource;

import java.nio.charset.Charset;
import java.io.*;
import java.net.*;

import com.sun.syndication.io.*;
import com.sun.syndication.feed.synd.*;

public class RomeTest {

    public static void main(String[] ignored)
        throws IOException, FeedException
    {
        String charset = "UTF-8";
        String url = "http://oglobo.globo.com/rss/plantaopais.xml";


        InputStream is = new URL(url).openConnection().getInputStream();
        InputSource source = new InputSource(is);

        SyndFeedInput input = new SyndFeedInput();
        SyndFeed feed = input.build(source);

        System.out.println("description: " + feed.getDescription());
    }


}

By using an InputSource with an InputStream instead of a Reader, the parser itself finds out the right charset, and gets it right.

Digging a bit around in the source, it seems our SyndFeed passes the Reader or InputSource to JDOM, which in turn passes it to the SAX XMLReader, which seems to get confused if confronted with a Reader which presents itself with <?xml ... encoding="ISO-8859-1" ?>. I then dug around in the source of Xerces (which seem to be the one used here), but didn't find anything suspicious which would cause this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow