Size of a char in a byte array

Question 1

Converting characters to bytes and vice versa is done using a character encoding.

The character encoding determines how characters are represented by bytes. For example, ASCII is a character encoding which uses 7 bits per character. Obviously, it can only represent 128 characters, way less than the 65,536 characters that exist in Java.

Other character encodings are UTF-8 and UTF-16. In fact, a Java char is really an UTF-16 character - if you directly cast it to an int, you would get the UTF-16 code for the character.

Here's a longer tutorial to character encodings: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

If you call getBytes() on a String, it will use the default character encoding of the system to convert the characters in the string to bytes. It's better to use the version of getBytes() that takes a character set name as an argument, so that you know what character set is used. For example:

byte[] bytes = str.getBytes("UTF-8");

Question 2

The internal format of a String uses 16 bits per character. When you convert it to a byte array, you use a certain character encoding which is either specified explicitly or the default platform encoding. The encoding may use fewer bits per character.

For example the ASCII encoding will store each character in a byte but it can only represent 128 different characters.

Another often used encoding is UTF-8 which uses a variable number of bytes per character. The first 128 characters (corresponding to the characters available in ASCII) can be stored in one byte each. Characters with order numbers 128 or higher need two or more bytes.

Question 3

getBytes() Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Your platform's default charset is probably UTF8. Hence, getBytes() will use one byte per character for characters which fit comfortably into that size.

Question 4

String.getBytes() "encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array". The platform's default charset (Charset.defaultCharset()) is probably UTF-8.

As for the second question, strings aren't actually required to use UTF-16. The way a JVM stores strings internally is irrelevant. The few occurrences of UTF-16 in the JVM spec apply only to chars.