Double byte character in Java

https://stackoverflow.com/questions/22901666

28-06-2023
|

题

Below code will print the length of byte store to below String which contain double byte Japanese character. Per my understanding, output of this program should be 2, however it is coming as 3. Why it this the case?

String j = "大";     
System.out.println(j.getBytes().length);

If this will be always the case, then should I assume below:

1, for single byte character, output of program will be always 1

2, for double byte character, output of program will be always 3

解决方案

UTF 8 characters byte length can be between 1 to 4 bytes. So your code is printing whatever is the correct byte length for the input japanese character.

其他提示

I believe the code point for that character is 0x5927, which when represented as UTF-8 is the three bytes E5 A4 A7. (Not all non-ASCII characters take 3 bytes in UTF-8, only those with code points in the range of 0x0800 and 0xFFFF.)

.getBytes() method uses the default system encoding (in case of Linux it's usually UTF-8).

Since you mentioned "one-byte" and "two-byte Japanese characters", I guess you want to use SJIS encoding. You do it this way:

String j = "大";     
System.out.println(j.getBytes("SJIS").length);

prints 2.

As a guideline, never use .getBytes without specifying an encoding and never use any other method or class that uses the default system encoding. You'll run your code on a different computer and it will stop working.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow