Question

I have a data provider that send me some data supposed to be coded with UTF-8. This data presents this sequence of bytes:

28 49 4e 54 e2 80 99 4c 29 20  (INT’L) => "(INT’L)"

For some reason when my java program fetch this data and store it in the database, the above sequence becomes:

28 49 4e 54 19 4c 29 20        (INT.L) => "(INT\u0019L)"

The java program is built on top of hibernate. It first fetches the data from the provider, stores it in an entity and then this entity is persisted in the database (Postgresql).

Why am I loosing the bytes (e2 80 99 becomes 19) ?
How can I avoid this ?

Here is the core method used to transfer the data fetched from the provider to the entity:

import java.sql.Clob;

//...

public static String convertStreamToString(Clob clob) throws SQLException {
    if (clob == null) {
        return "";
    }

    BufferedReader br = null;
    StringBuilder result = new StringBuilder();

    try {
        br = new BufferedReader(new InputStreamReader(clob.getAsciiStream(), Charset.forName("UTF-8")));
        String lig;
        int n = 0;
        while ((lig = br.readLine()) != null) {
            if (n > 0) {
                result.append("\n");
            }
            result.append(lig);
            n++;
        }
    } catch (IOException ioe) {
         // Exception handling code ...
    } catch (SQLException sqlex) {
         // Exception handling code ...
    } finally {
        IOUtil.close(br);
    }

    return result.toString();
}

// ...

MyEntity entity = ...
oracle.sql.NCLOB clob = ...
entity.setProperty(convertStreamToString(clob));


@Entity
class MyEntity {
     @Column(name="prop", length=100000)
     private String prop;  

     public void setProperty(String value) {
          this.prop=value;
     }
}
Was it helpful?

Solution

You are using the getAsciiStream() to read the contents of a CLOB. As the name says, this method is usable only for ASCII; it breaks all non-ASCII characters.

Use the getCharacterStream method instead.

BufferedReader br = null;
StringBuilder result = new StringBuilder();

try {
    br = new BufferedReader(clob.getCharacterStream());
    ....

OTHER TIPS

I think DB column type should support UTF-8. In Oracle this is NVARCHAR, in Postgresql it is something like this

create table test (
    utf8fld varchar(50)
);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top