Is there a MySQL character set and encoding that will allow for both emojis and accents?

https://dba.stackexchange.com/questions/249668

14-02-2021
|

题

I've got a database of terms that get added to by one group of users, and queried against by another.

I was running into problems when people would query for an emoji in the database and my React app would throw an error (oddly a CORS error, but that's a different issue). I eventually solved this by changing my table's character set to utf8mb4 with utf8mb4_unicode_ci collation.

Now I'm getting errors when adding new terms saying, for example, that a duplicate row for "beyoncé" already exists. However, when I query the db for "beyoncé", it returns the row with "beyonce" in it. Is there a combination of charset and collation that can handle this properly?

My DB is MySQL 5.6.40 running on Amazon RDS.

解决方案

I was running into problems when people would query for an emoji in the database and my React app would throw an error

What was the exact error message? What were the character set and collation of the column before you changed it to utfmb4 and utf8mb4_unicode_ci? In MySQL, collation can be set at many levels, including the client connection.

That said, Unicode (utf8 being one of the Unicode encodings), supports all characters. If your character set truly is utf8mb4, then there is no need to change that.

You said:

I'm getting errors when adding new terms saying for example that a duplicate row for "beyoncé" already exists, however when I query the db for "beyoncé", it returns the row with "beyonce" in it.

The MySQL documentation states:

For nonbinary collation names that do not specify accent sensitivity, it is determined by case sensitivity. If a collation name does not contain _ai or _as, _ci in the name implies _ai and _cs in the name implies _as.

So, since your collation is utf8mb4_unicode_ci, then it is both case-insensitive and accent-insensitive. And this is why "beyoncé" matches "beyonce".

If you need "beyoncé" and "beyonce" to be considered different, then ideally you would use a case-sensitive (and either explicitly-stated or implied accent-sensitive) collation. However, it looks like this is not available in MySQL 5.6 (or even 5.7), while MySQL 8.0 does have utf8mb4_0900_as_cs, or even utf8mb4_0900_as_ci if you only want the accent to distinguish between the values while allowing "beyonce" and "Beyonce" to match.

So for now, it looks like you might need to use a binary collation, utf8mb4_bin, either by changing the collation of the column, or by adding COLLATE utf8mb4_bin to one or more queries.

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange