Newly-developed encoding issue with an Oracle Java program

https://stackoverflow.com/questions/23604115

20-07-2023
|

Question

My Java program (or rather, a part of it) sends a request to a webservice and receives rdf-strings including ancient Greek words in unicode. I wrote the program in netbeans and so far, there has not been a problem during run-time, both in the netbeans environment and outside as a standalone jar under Linux and Windows XP. Now, all of a sudden the Greek words in the rdf come back garbled like this:

á¼€

At first, I thought this was a Windows XP problem, but when checking under Windows 7 the problem persisted. I found out that I was running OpenJDK under Linux, and was since able to reproduce the issue using Oracle Java. This is the relevant code (of course, I may have tunnel vision, so please tell me if you need more):

try {
        HttpClient client = new DefaultHttpClient();
        HttpGet get;
        get = new HttpGet(URL+URLEncoder.encode(form, "UTF-8"));

        HttpResponse response = client.execute(get);
        if (201 == response.getStatusLine().getStatusCode()) {
            HttpEntity respEnt = response.getEntity();
            BufferedReader reader = new BufferedReader(new InputStreamReader(respEnt.getContent()));
            StringBuilder sb = new StringBuilder();
            char[] cbuffer = new char[256];
            int read;

            while ((read = reader.read(cbuffer)) != -1) {
                sb.append(cbuffer,0,read);
            }
            //System.out.println(sb.toString());
            rdf = new String(sb.toString().getBytes("UTF-8"),"UTF-8");

        } else {
            System.err.println("HTTP Request fehlgeschlagen.");
        }         

    } catch (IOException e) {
        System.err.println("Problem beim HTTP Request.");
    }

The webservice is the Perseus morphology service, it can be found here: http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=grc&engine=morpheusgrc&word=. Try "word=μῆνιν", for example. How or when the rdf is generated, I really don't know.

I would be very grateful for further insights!

Solution

Make sure the encoding of your strings is consistent from client to server and back again. In your case of course the servers response (rdf-strings) is most important (encoding serveside, decoding in your client code).

One thing concerning the client code you posted : You are using the one argument constructor of InputStreamReader in this line:

BufferedReader reader = new BufferedReader(new InputStreamReader(respEnt.getContent()));

It will read from the inputstream using the VM (and systems) default charset, so the outcome will depend on the machine/VM you are running your client application on. Try explicitly setting the charset using this constructor

new InputStreamReader(url.openStream(), "UTF-8")