Is there a MySQL character set and encoding that will allow for both emojis and accents?
-
14-02-2021 - |
题
I've got a database of terms that get added to by one group of users, and queried against by another.
I was running into problems when people would query for an emoji in the database and my React app would throw an error (oddly a CORS error, but that's a different issue). I eventually solved this by changing my table's character set to utf8mb4
with utf8mb4_unicode_ci
collation.
Now I'm getting errors when adding new terms saying, for example, that a duplicate row for "beyoncé" already exists. However, when I query the db for "beyoncé", it returns the row with "beyonce" in it. Is there a combination of charset and collation that can handle this properly?
My DB is MySQL 5.6.40 running on Amazon RDS.
解决方案
I was running into problems when people would query for an emoji in the database and my React app would throw an error
What was the exact error message? What were the character set and collation of the column before you changed it to utfmb4
and utf8mb4_unicode_ci
? In MySQL, collation can be set at many levels, including the client connection.
That said, Unicode (utf8
being one of the Unicode encodings), supports all characters. If your character set truly is utf8mb4
, then there is no need to change that.
You said:
I'm getting errors when adding new terms saying for example that a duplicate row for "beyoncé" already exists, however when I query the db for "beyoncé", it returns the row with "beyonce" in it.
The MySQL documentation states:
For nonbinary collation names that do not specify accent sensitivity, it is determined by case sensitivity. If a collation name does not contain
_ai
or_as
,_ci
in the name implies_ai
and_cs
in the name implies_as
.
So, since your collation is utf8mb4_unicode_ci
, then it is both case-insensitive and accent-insensitive. And this is why "beyoncé" matches "beyonce".
If you need "beyoncé" and "beyonce" to be considered different, then ideally you would use a case-sensitive (and either explicitly-stated or implied accent-sensitive) collation. However, it looks like this is not available in MySQL 5.6 (or even 5.7), while MySQL 8.0 does have utf8mb4_0900_as_cs
, or even utf8mb4_0900_as_ci
if you only want the accent to distinguish between the values while allowing "beyonce" and "Beyonce" to match.
So for now, it looks like you might need to use a binary collation, utf8mb4_bin
, either by changing the collation of the column, or by adding COLLATE utf8mb4_bin
to one or more queries.