Question

Below code will print the length of byte store to below String which contain double byte Japanese character. Per my understanding, output of this program should be 2, however it is coming as 3. Why it this the case?

String j = "大";     
System.out.println(j.getBytes().length);

If this will be always the case, then should I assume below:

1, for single byte character, output of program will be always 1

2, for double byte character, output of program will be always 3

Was it helpful?

Solution

UTF 8 characters byte length can be between 1 to 4 bytes. So your code is printing whatever is the correct byte length for the input japanese character.

OTHER TIPS

I believe the code point for that character is 0x5927, which when represented as UTF-8 is the three bytes E5 A4 A7. (Not all non-ASCII characters take 3 bytes in UTF-8, only those with code points in the range of 0x0800 and 0xFFFF.)

.getBytes() method uses the default system encoding (in case of Linux it's usually UTF-8).

Since you mentioned "one-byte" and "two-byte Japanese characters", I guess you want to use SJIS encoding. You do it this way:

String j = "大";     
System.out.println(j.getBytes("SJIS").length);

prints 2.

As a guideline, never use .getBytes without specifying an encoding and never use any other method or class that uses the default system encoding. You'll run your code on a different computer and it will stop working.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top