Reading from Windows-1252 format from Oracle and Writing to XML file with Latin1 characters UTF-8 encoded

https://stackoverflow.com/questions/22685230

22-06-2023
|

Question

I am trying to read from an oracle db which stores data in Windows-1252 encoding. I am reading that data using jdbc and writing to an xml file with UTF-8 encoding.

while writing to these files, I am getting '?' characters instead of the latin characters e.g. instead of í, i get a ?

'Coquí' is being written to XML as 'Coqu?'

I use this file to upload to solr later on. I have only put the relevant code here and not the whole code since its a long method (legacy code that i have inherited) which is complicated.

BufferedWriter result = new BufferedWriter(new FileWriter(OUTPUT_FILE));

                stmt = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
                rst = stmt.executeQuery(sql);
                if (rst.getFetchSize() < 1)
                    return;

                    rst.beforeFirst();

                while (rst.next()) {

                    Profile p = new Profile(); 
                    p.business_name = rst.getString("business_name");
                    p.business_name_sort = rst.getString("business_name_sort");

                      result.write(p.business_name;
                      result.write(p.business_name_sort);

                 }

Solution

By the sounds of it (you haven't given us the relevant code so I can't be certain) you aren't handling character set conversion properly. Java doesn't perform any automatic character set conversions for you - you've got to do it yourself.

You can do the following to convert it to UTF-8:

String utf8Text = new String(originalText.getBytes("UTF-8"), "UTF-8");

This assumes that originalText is a String containing the Windows-1252 encoded text.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow