Convert Character Encoding, and Check Encoding for Issues
-
06-01-2021 - |
سؤال
I'm planning to convert a table from utf8
(_general_ci
) to utf8mb4
(utf8mb4_unicode_ci
). I want to make sure I don't have any malformed data after the conversion though so I was planning to duplicate the data, run a join on the two tables, and see if there are any differences.
Is this a good way to go about it?
1.
Create table Email_utf8mb4 like Email;
2.
insert into Email_utf8mb4 select * from Email;
3.
ALTER TABLE Email_utf8mb4
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
4.
select * from Email as old
join Email_utf8mb4 as new
on old.notificationid =new.notificationid
and (old.subject <> new.subject OR old.content <>
new.content)
4.5. Assuming no rows are returned...
5.
drop Email;
6.
RENAME TABLE Email_utf8mb4 to Email;
subject
and content
are the only alpha columns in my table.
المحلول
That won't fix or recognize any malformed data already in the table.
Yes, that is a good technique. ALTER..CONVERT TO..
does the bulk of the work. And, since utf8 is a subset of utf8mb4, there should be no differences discovered in step 4.
However, there is still a possibility of step 4 showing something. This is because the definition of "equal" (hence <>
) is different for _general_ci
versus _unicode_ci
.
For example, in German, ss
and ß
are unequal in _general_ci (either utf8 or utf8mb4), but equal in virtually all other collations. If, for example, you currently have a UNIQUE
(or PRIMARY KEY
) with a value that differs in just ss
vs ß
, the conversion will have a problem with "duplicate key" error.
Another problematic pair: ae
vs æ
A different issue - not equality, but ordering - Ð
< E
for _unicode_ci, but not for some other collations.
Meanwhile, as long as you are changing the COLLATION
, you may as well go to the newer Unicode Algorithm in utf8mb4_unicode_520_ci
.