Convert Character Encoding, and Check Encoding for Issues

https://dba.stackexchange.com/questions/211939

06-01-2021
|

سؤال

I'm planning to convert a table from utf8 (_general_ci) to utf8mb4 (utf8mb4_unicode_ci). I want to make sure I don't have any malformed data after the conversion though so I was planning to duplicate the data, run a join on the two tables, and see if there are any differences.

Is this a good way to go about it?

Create table Email_utf8mb4  like Email;

insert into Email_utf8mb4 select * from Email;

ALTER TABLE Email_utf8mb4 
CONVERT TO CHARACTER SET utf8mb4 
COLLATE utf8mb4_unicode_ci;

select * from Email as old
join Email_utf8mb4 as new  
on old.notificationid =new.notificationid 
and (old.subject <> new.subject OR old.content <>
    new.content)

4.5. Assuming no rows are returned...

drop Email;

RENAME TABLE Email_utf8mb4 to Email;

subject and content are the only alpha columns in my table.

المحلول

That won't fix or recognize any malformed data already in the table.

Yes, that is a good technique. ALTER..CONVERT TO.. does the bulk of the work. And, since utf8 is a subset of utf8mb4, there should be no differences discovered in step 4.

However, there is still a possibility of step 4 showing something. This is because the definition of "equal" (hence <>) is different for _general_ci versus _unicode_ci.

For example, in German, ss and ß are unequal in _general_ci (either utf8 or utf8mb4), but equal in virtually all other collations. If, for example, you currently have a UNIQUE (or PRIMARY KEY) with a value that differs in just ss vs ß, the conversion will have a problem with "duplicate key" error.

Another problematic pair: ae vs æ

A different issue - not equality, but ordering - Ð < E for _unicode_ci, but not for some other collations.

Meanwhile, as long as you are changing the COLLATION, you may as well go to the newer Unicode Algorithm in utf8mb4_unicode_520_ci.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange