How large space will be occupied by mysql for a varchar utf8 column?
-
30-09-2020 - |
Question
Tip: To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.
From:http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html
Does this mean if I use a varchar utf8 column other than char utf8 column in mysql, mysql will physically store characters using 1 byte for latin and 3 bytes for CJK(Chinese、japanese、Korean) ?
Solution
CHAR(...) CHARACTER SET utf8
always takes 3 bytes per character -- CHAR(100)
occupies 300 bytes (no length needed).
VARCHAR
occupies 1-2 bytes for a length, plus only as many bytes as needed. So VARCHAR(100)
with hello
will occupy 7 (2+5) bytes in any character set.
Señor
, in CHARACTER SET latin1
, take 5 bytes (plus length). In utf8, it takes 6 bytes (plus length). This is because ñ
is the 1-byte hex F1
in latin1 or the 2-byte C3B1
for utf8.
Some Chinese characters and some Emoji, need 4 bytes, so utf8mb4
is a better choice for them.
There is very little use for ever using CHAR
instead of VARCHAR
. They are usually for consistent-length fields such as postal code, country code, md5, uuid, etc. And all of those may as well be CHARACTER SET ascii
or possibly latin1
, but not utf8
.
Back to your question... With CHARACTER SET utf8
(or utf8mb4):
- English letters take 1 byte.
- Western European accented letters take 2 bytes.
- Greek, Hebrew, and other languages in that general area, take 2 bytes.
- Japanese and Korean take 3 bytes.
- Most Chinese characters are 3 bytes; some are 4 and need utf8mb4.
- A longer list
"mysql will physically store characters using 1 byte for latin and 3 bytes for CJK" is not phrased correctly. Change 'latin' to 'English' to make it correct. ('Latin' tends to include various accented characters.)