Java 1.6 Windows-1252 encoding fails on 3 characters

https://stackoverflow.com/questions/2147845

23-09-2019
|

Question

EDIT: I've been convinced that this question is somewhat non-sensical. Thanks to those who responded. I may post a follow-up question that is more specific.

Today I was investing some encoding problems and wrote this unit test to isolate a base repro case:

int badCount = 0;
for (int i = 1; i < 255; i++) {
    String str = "Hi " + new String(new char[] { (char) i });

    String toLatin1  = new String(str.getBytes("UTF-8"), "latin1");
    assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));

    String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
    String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");

    if (!str.equals(fromWin1252)) {
        System.out.println("Can't encode: " + i + " - " + str + 
                           " - encodes as: " + fromWin1252);
        badCount++;
    }
}

System.out.println("Bad count: " + badCount);

The output:

    Can't encode: 129 - Hi ? - encodes as: Hi ??
    Can't encode: 141 - Hi ? - encodes as: Hi ??
    Can't encode: 143 - Hi ? - encodes as: Hi ??
    Can't encode: 144 - Hi ? - encodes as: Hi ??
    Can't encode: 157 - Hi ? - encodes as: Hi ??
    Can't encode: 193 - Hi Á - encodes as: Hi ??
    Can't encode: 205 - Hi Í - encodes as: Hi ??
    Can't encode: 207 - Hi Ï - encodes as: Hi ??
    Can't encode: 208 - Hi ? - encodes as: Hi ??
    Can't encode: 221 - Hi ? - encodes as: Hi ??
    Bad count: 10

JDK 1.6.0_07 on Mac OS 10.6.2

My observation:

Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I wouldn't expect any issues.

Can anyone explain this behavior? Is this a JDK bug?

-- James

Solution

In my opinion the testing program is deeply flawed, because it makes effectively useless transformations between Strings with no semantic meaning.

If you want to check if all byte values are valid values for a given encoding, then something like this might be more like it:

public static void tryEncoding(final String encoding) throws UnsupportedEncodingException {
    int badCount = 0;
    for (int i = 1; i < 255; i++) {
        byte[] bytes = new byte[] { (byte) i };

        String toString = new String(bytes, encoding);
        byte[] fromString = toString.getBytes(encoding);

        if (!Arrays.equals(bytes, fromString)) {
            System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: "
                    + Arrays.toString(fromString) + " - result: " + toString);
            badCount++;
        }
    }

    System.out.println("Bad count: " + badCount);
}

Note that this testing program tests inputs using the (usnigned) byte values from 1 to 255. The code in the question uses the char values (equivalent to Unicode codepoints in this range) from 1 to 255.

Try printing the actual byte arrays handled by the program in the example and you see that you're not actually checking all byte values and that some of your "bad" matches are duplicates of others.

Running this with "Windows-1252" as the argument produces this output:

Can't encode: 129 - in: [-127]/ out: [63] - result: �
Can't encode: 141 - in: [-115]/ out: [63] - result: �
Can't encode: 143 - in: [-113]/ out: [63] - result: �
Can't encode: 144 - in: [-112]/ out: [63] - result: �
Can't encode: 157 - in: [-99]/ out: [63] - result: �
Bad count: 5

Which tells us that Windows-1252 doesn't accept the byte values 129, 1441, 143, 144 and 157 as valid values. (Note: I'm talking about unsigned byte values here. The code above shows -127, -115, ... because Java only knows unsigned bytes).

The Wikipedia article on Windows-1252 seems to verify this observation by stating this:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused

OTHER TIPS

What your code does (String->byte[]->String, twice) is pretty much the opposite of transcoding, and makes no sense at all (it's virtually guaranteed to lose data). Transcoding means byte[]->String->byte[]:

public byte[] transcode(byte[] input, String inputEnc, String targetEnc)
{
    return new String(input, inputEnc).getBytes(targetEnc);
}

And of course, it will lose data when the input contains characters that the target encoding does not support.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow