Question

I have created a file with UTF-8 encoding, but I don't understand the rules for the size it takes up on disk. Here is my complete research:

  1. First I created the file with a single Hindi letter 'क' and the file size on Windows 7 was
    8 bytes.

  2. Now with two letter 'कक' and the file size was 11 bytes.

  3. Now with three letter 'ककक'and the file size was 14 bytes.

Can someone please explain me why it is showing such sizes?

Was it helpful?

Solution

The first three bytes are used for the BOM (Byte Order Mark) EF BB BF.

Then, the bytes E0 A4 95 encode the letter क.

Then the bytes 0D 0A encode a carriage return.

Total: 8 bytes. For each letter क you add, you need three more bytes.

OTHER TIPS

On linux based systems, you can use hexdump to get the hexadecimal dump(used by Tim in his answer) and understand how many bytes a character is allocating.

echo -n a | hexdump -C echo -n क | hexdump -C

Here's the output of the above two command. enter image description here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top