Question

As java doc states it:

char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

But when I have a String (just containing ASCII-signs) and convert it to a byte array, every sign of the String is stored in one byte, which is less than the 16 bit as java docs states it. How does it work? I could imagine that the java compiler/interpreter uses just one byte per char for an ASCII sign for performance issues.

Furthermore, what happens if I've got a String with just ASCII signs and one UTF-16 sign and convert it to a byte array. Every sign of the String uses 2 bytes now?

Was it helpful?

Solution

Converting characters to bytes and vice versa is done using a character encoding.

The character encoding determines how characters are represented by bytes. For example, ASCII is a character encoding which uses 7 bits per character. Obviously, it can only represent 128 characters, way less than the 65,536 characters that exist in Java.

Other character encodings are UTF-8 and UTF-16. In fact, a Java char is really an UTF-16 character - if you directly cast it to an int, you would get the UTF-16 code for the character.

Here's a longer tutorial to character encodings: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

If you call getBytes() on a String, it will use the default character encoding of the system to convert the characters in the string to bytes. It's better to use the version of getBytes() that takes a character set name as an argument, so that you know what character set is used. For example:

byte[] bytes = str.getBytes("UTF-8");

OTHER TIPS

The internal format of a String uses 16 bits per character. When you convert it to a byte array, you use a certain character encoding which is either specified explicitly or the default platform encoding. The encoding may use fewer bits per character.

For example the ASCII encoding will store each character in a byte but it can only represent 128 different characters.

Another often used encoding is UTF-8 which uses a variable number of bytes per character. The first 128 characters (corresponding to the characters available in ASCII) can be stored in one byte each. Characters with order numbers 128 or higher need two or more bytes.

getBytes() Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Your platform's default charset is probably UTF8. Hence, getBytes() will use one byte per character for characters which fit comfortably into that size.

String.getBytes() "encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array". The platform's default charset (Charset.defaultCharset()) is probably UTF-8.

As for the second question, strings aren't actually required to use UTF-16. The way a JVM stores strings internally is irrelevant. The few occurrences of UTF-16 in the JVM spec apply only to chars.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top