Change default collation for character set utf8mb4 to utf8mb4_unicode_ci
-
05-02-2021 - |
Question
When creating a database without specifying a character set or collation the servers defaults are used (as expected).
MariaDB [(none)]> SHOW VARIABLES LIKE '%_server' ;
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| character_set_server | utf8mb4 |
| collation_server | utf8mb4_unicode_ci |
+----------------------+--------------------+
MariaDB [(none)]> CREATE DATABASE `test-without-charset` ;
MariaDB [(none)]> SELECT `DEFAULT_COLLATION_NAME` FROM `information_schema`.`SCHEMATA` WHERE `SCHEMA_NAME` LIKE 'test-without-charset';
+------------------------+
| DEFAULT_COLLATION_NAME |
+------------------------+
| utf8mb4_unicode_ci |
+------------------------+
However, when specifying the character set within the CREATE DATABASE
-query, the default collation changes to utf8mb4_general_ci
.
MariaDB [(none)]> CREATE DATABASE `test-with-charset` CHARACTER SET utf8mb4 ;
MariaDB [(none)]> SELECT `DEFAULT_COLLATION_NAME` FROM `information_schema`.`SCHEMATA` WHERE `SCHEMA_NAME` LIKE 'test-with-charset';
+------------------------+
| DEFAULT_COLLATION_NAME |
+------------------------+
| utf8mb4_general_ci |
+------------------------+
I already found out that (mysql-manual)
If CHARACTER SET charset_name is specified without COLLATE, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.
And indeed it shows utf8mb4_general_ci
, so it is following the rules
MariaDB [(none)]> SHOW CHARACTER SET LIKE 'utf8mb4';
+---------+---------------+--------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------+--------------------+--------+
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
+---------+---------------+--------------------+--------+
So my question is: How do I change this default collation for the character set utf8mb4
. Is there some configuration-file I can change to alter this behaviour? I'd really like those two to be consistent.
Off course I tried Google to find anything relevant, but all I can find is changing the collation_server
-setting.
Server version: 10.3.15-MariaDB-log MariaDB Server
Solution
I don't think there is a way to change that DEFAULT
.
Anyway, it would be better to use utf8mb4_unicode_520_ci
, which is based on a later Unicode standard.
Just get into the habit of specifying CHARACTER SET
and COLLATION
on all connections and CREATE TABLEs
. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci. MariaDB is not there yet, but I expect them to move soon. And "900" is probably not the last Unicode standard.
By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair.
A compromise...
But charset and collation on CREATE DATABASE
. Then any tables built without specific settings will inherit those settings. And columns within that table will inherit from the table's settings.
OTHER TIPS
Option 1
- IF you are using MySQL 8.0.11 or newer (not sure how that equates to MariaDB 10.3.15), and
- IF you are ok using
utf8mb4_0900_ai_ci
instead ofutf8mb4_unicode_ci
then it seems that a server system variable — @@default_collation_for_utf8mb4 — was added in 8.0.11, but the only valid values are:
utf8mb4_general_ci
utf8mb4_0900_ai_ci
However, if you are seeing a default collation of utf8mb4_general_ci
for utf8mb4
instead of utf8mb4_0900_ai_ci
, then I am guessing that you don't have this new system variable.
Option 2
The documentation does show a mechanism for defining your own UCA collation, though it is unclear if this can be used to override a default. I can't test it, but it's worth looking into:
- Adding a Character Set (Item 1 shows basic syntax, config file location, and rules)
- Adding a UCA Collation to a Unicode Character Set
- Adding collation to utf8mb4 charset (MySQL forum question with someone trying to add a collation to
utf8mb4
, even if not the default). The answer by Xing Zhang resolves the issue, and the only problem appears to have been the collation ID.
Putting that all together, the following might work (but again, I have no way to test):
<charset name="utf8mb4">
<family>Unicode</family>
<description>UTF-8 MB4 Unicode</description>
<collation name="utf8mb4_unicode_ci" id="224">
<flag>primary</flag>
<flag>compiled</flag>
</collation>
<collation name="utf8mb4_general_ci" id="45">
<flag>compiled</flag>
</collation>
<collation name="utf8mb4_bin" id="46">
<flag>binary</flag>
<flag>compiled</flag>
</collation>
<collation name="utf8mb4_unicode_520_ci" id="246">
<flag>compiled</flag>
</collation>
</charset>
Now, the documentation does state:
You must assign a unique ID number to each collation. The range of IDs from 1024 to 2047 is reserved for user-defined collations. To find the maximum of the currently used collation IDs, use this query:
SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
However, I used the actual IDs with the idea being that we are merely changing the default, not starting with a base collation and adding new rules. I found the IDs here;
Option 3
If all else fails, I would post this question to the following MySQL forum as it looks like you will get rather authoritative answers (based on who is answering some of those questions):