MySQL UTF/Unicode migration tips
-
09-06-2019 - |
Question
Does anyone have any tips or gotcha moments to look out for when trying to migrate MySQL tables from the the default case-insenstive swedish or ascii charsets to utf-8? Some of the projects that I'm involved in are striving for better internationalization and the database is going to be a significant part of this change.
Before we look to alter the database, we are going to convert each site to use UTF-8 character encoding (from least critical to most) to help ensure all input/output is using the same character set.
Thanks for any help
Solution
Some hints:
- Your
CHAR
andVARCHAR
columns will use up to 3 times more disk space. (You probably won't get much disk space grow for Swedish words.) - Use
SET NAMES utf8
before reading or writing to the database. If you don't this then you will get partially garbled characters.
OTHER TIPS
I am going to be going over the following sites/articles to help find an answer.
Hanselminutes episode "Sorting out Internationalization with Michael Kaplan"
And I also just found a very on topic post by Derek Sivers @ O'Reilly ONLamp Blog as I was writing this out. Turning MySQL data in latin1 to utf8 utf-8
Beware index length limitations. If a table is structured, say:
a varchar(255) b varchar(255) key ('a', 'b')
You're going to go past the 1000 byte limit on key lengths. 255+255 is okay, but 255*3 + 255*3 isn't going to work.
Your
CHAR
andVARCHAR
columns will use up to 3 times more disk space.
Only if they're stuffed full of latin-1 with ordinals > 128. Otherwise, the increased space use of UTF-8 is minimal.
The collations are not always favorable. You'll get umlats collating to non umlatted versions which is not always correct. Might want to go w/ utf8_bin, but then everything is case sensitive as well.