Having charset “utf8mb4” - does it mean every character takes 4 bytes, or only those that need 4 bytes?
-
15-03-2021 - |
Question
I have a VARCHAR(80) utf8mb4
column, and I'm testing adding some ASCII & Emoji characters, and use LENGTH()
and CHAR_LENGTH()
to understand the differences.
From reading in different places, my understanding is that each character on a utf8mb4
column would take 4 bytes. However, it appears that it may not be the case, if I understand LENGTH()
as giving me the actual size that the specific contents are taking on that field.
Is it correct that a row with "aaaa" contents would take less storage than a row with "😁😁😁😁"?
No correct solution
OTHER TIPS
Actually, my answer is in the Documentation: https://mariadb.com/kb/en/unicode/
UTF-8 encoding using one to three bytes per character. Basic Latin letters, numbers and punctuation use one byte. European and Middle East letters mostly fit into 2 bytes. Korean, Chinese, and Japanese ideographs use 3-bytes. No supplementary characters are stored.
utf8mb4 -- Same as utf8, but stores supplementary characters in four bytes.
Also, as per https://mariadb.com/kb/en/data-type-storage-requirements/
String Data Types
In the descriptions below, M is the declared column length (in characters or in bytes), while len is the actual length in bytes of the value.
VARCHAR(M) -- len + 1 bytes if column is 0 – 255 bytes, len + 2 bytes if column may require more than 255 bytes
So, for VARCHAR(80) utf8mb4
,
80x4 > 255, so a field with "aaaa" will take 6 bytes (len + 2)
When mysql/mariadb creates columns for UTF8 data they are created big enough to fit any string having given encoding.
VARCHAR(80) COLLATE 'utf8_general_ci'
column will be exactly 80x3=240 bytes long.
VARCHAR(80) COLLATE 'utf8mb4_general_ci'
column will be exactly 80x4=320 bytes long.
P.S.
InnoDB engine has so called "overflow pages" for table storage where all long data resides. The root page stores only a prefix for indexing and a pointer to the following page. The same is for TEXT
and BLOB
types.
That is why "row size" for table returned by SHOW TABLE STATUS
can be less than max data size table is able to store.