Storage space requirements for a varchar(40) column to store 'abcd' with charset 'ucs2' and 'utf8mb4'?

dba.stackexchange https://dba.stackexchange.com/questions/273592

  •  06-03-2021
  •  | 
  •  

Question

I understand that if the length is > 255 it will require 2 bytes to store the length prefix. I just need to verify my assumptions. Please tell me the length prefix and the length of string('abcd') for charsets(latin1, ucs2, utf8mb4). For a varchar(20) column. The documentaton says:

[...] For the string 'abcd', L is 4 and the storage requirement is five bytes. If the same column is instead declared to use the ucs2 double-byte character set, the storage requirement is 10 bytes: The length of 'abcd' is eight bytes and the column requires two bytes to store lengths because the maximum length is greater than 255 (up to 510 bytes). [...]

Reference: 11.7 Data Type Storage Requirements (Oracle | Docs)

I just don't understand why for 'abcd' with charset 'ucs2' it requires 2 byte length prefix even after it being less than 255.

Was it helpful?

Solution

abcd is all English, so: 9 bytes for ucs2; 5 bytes for utf8. But...

UCS2 uses 2 byte per character (at least for 'western' letters). I think it takes 4 bytes for most Asian characters.

utf8 (or utf8mb4) is a variable-length encoding. English letters take 1 byte each; most European text takes 1 or 2 bytes per character. Asian takes 3, sometimes 4 bytes per character.

latin1 has only 1-byte characters, so it is limited to English, plus some accented European letters.

But, but...

VARCHAR(40) stores up to 40 characters (not bytes) in whatever encoding is being used. The disk space be 1 byte for length plus up to 4*40 bytes for text. A long example would be 40 Emoji, taking 161 bytes.

<opinion> There is virtually no reason to use the CHARACTER SET ucs2 (or ucs4). In particular, anything involving unicode or utf8 should use utf8mb4. </opinion>

But But But...

Re your comment about 10 bytes -- InnoDB will sometimes use 1 byte for the length, sometimes 2. But the decision is based on all the columns in the table. Your "40" does not force a 2-byte length, but some other column might.

The documentation says "For example, a VARCHAR(255) column ... (up to 510 bytes)". So it needs 2 bytes since the max byte length is >255. Your example is with "VARCHAR(40)", which is "up to 80 bytes" (<=255).

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top