Bytes count in UTF-16 strings

https://stackoverflow.com/questions/20151274

03-08-2022
|

Question

Why does UTF-16 string of 2 characters take only 6 bytes in memory, while UTF-16 string of 1 character takes 4 bytes?

Here's a SSCCE in java to demonstrate this behavior:

public class UTF16Test{
    public static void main(String[] args) throws Exception {
        System.out.println("A".getBytes("UTF-16").length);
        System.out.println("AB".getBytes("UTF-16").length);
    }
}

Output:

4
6

Solution

You have to take in to consideration the 2-byte Byte-Order-Marks for UTF-16?

Your first 2 bytes will be either: FE FF or FF FE depending on whether you are on Big or Little endian machine. You should check...

I did, and it's [-2, -1, 0, 65, 0, 66].

You should also consider that the String values do not take this extra space in memory, only when the String is encoded as byte[] will the byte-order-mark be added.... The String AB will use char[2] in memory until it is byte-encoded as byte[6].

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow