Question

I'm attempting to convert a string of characters from ASCII to EBCDIC using an IBM codepage. The conversion is correct except for lower case 'a' which is converted to an unprintable character.

Here is a piece of a groovy script running in Windows 7 that illustrates the problem.

groovy:000> letters='abcdABCD'
===> abcdABCD
groovy:000> String.format("%04x", new BigInteger(1, letters.getBytes())
===> 6162636441424344
groovy:000> lettersx=new String(letters.getBytes('IBM500'))
===> ?éâä┴┬├─
groovy:000> String.format("%04x", new BigInteger(1, lettersx.getBytes()))
===> 3f828384c1c2c3c4

After converting to EBCDIC all the characters in the string are valid except the first one, a lower case 'a'. Try as I might I can't find any information on this problem. I've tried a number of IBM code pages with the same results (IBM01140, IBM1047 etc.)

Was it helpful?

Solution

The problem is in this expression:

new String(letters.getBytes('IBM500'))

letters.getBytes creates a byte-array containing (in hexadecimal):

 81 82 83 84 C1 C2 C3 C4

but then you're immediately converting that back to a Unicode String using your platform default encoding:

 new String( <byte-array> );

If you want the ordinal values of the the characters in your String to be equal to the byte value, you must specify an encoding that does that, for example ISO-8859-1:

new String(letters.getBytes('IBM500'), "ISO-8859-1")

The encoding you're using does not define a character encoding for byte 81 so it is replacing it with ? (3f). You're most likely using Windows-1252.

Strings contain characters, not bytes. Java will always apply an encoding conversion when going from one to the other.

EDIT: responding to @mister270's comment:

Here's a program in Java to demonstrate:

public class Ebcdic
{
    public static void main(String[] args) throws Exception
    {
        String letters = "abcdABCD";

        byte[] ebcdic = letters.getBytes("IBM500");

        System.out.print("Ebcdic bytes:");
        for (byte b: ebcdic)
        {
            System.out.format(" %02X", b & 0xFF);
        }
        System.out.println();

        String lettersEbcdic = new String(ebcdic, "ISO-8859-1");

        System.out.print("Ebcdic bytes stored in chars:");
        for (char c: lettersEbcdic.toCharArray())
        {
            System.out.format(" %04X", (int) c);
        }
        System.out.println();

        System.out.println("Ebcdic bytes in chars printed in using my default platform encoding: " + lettersEbcdic);
    }
}

Output is:

Ebcdic bytes: 81 82 83 84 C1 C2 C3 C4
Ebcdic bytes stored in chars: 0081 0082 0083 0084 00C1 00C2 00C3 00C4
Ebcdic bytes in chars printed in using my default platform encoding: ????��ǎ

What this shows is that

  • the Ebcdic conversion into the byte-array is occurring correctly using "IBM500"
  • the "identity" conversion of bytes to chars using "ISO-8859-1" is occurring correctly
  • My system doesn't have a mapping to convert Unicode character U+0081 etc to my default platform character encoding so it displays it as ?

Java (so Groovy too) stores characters internally as Unicode. UTF16, to be precise. If you want to encode them as Ebcdic, then they stop being characters and should no longer be held in the Strings. Ebcdic is an 8-bit encoding so each character can be stored in a byte. If you need to interface with a system that expects a particular encoding (in your case, Ebcdic), then that system really should accept bytes, not Strings, otherwise you end up with just these sorts of confusion.

If you must use Strings to hold Ebcdic bytes, then you must use the ISO-8859-1 encoding whenever you use an InputStream or OutputStream (including System.out) to ensure that your ebcdic codes are not "translated" from bytes to characters

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top