Question

When creating a database without specifying a character set or collation the servers defaults are used (as expected).

MariaDB [(none)]> SHOW VARIABLES LIKE '%_server' ;
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| character_set_server | utf8mb4            |
| collation_server     | utf8mb4_unicode_ci |
+----------------------+--------------------+

MariaDB [(none)]> CREATE DATABASE `test-without-charset` ;
MariaDB [(none)]> SELECT `DEFAULT_COLLATION_NAME` FROM `information_schema`.`SCHEMATA` WHERE `SCHEMA_NAME` LIKE 'test-without-charset';
+------------------------+
| DEFAULT_COLLATION_NAME |
+------------------------+
| utf8mb4_unicode_ci     |
+------------------------+

However, when specifying the character set within the CREATE DATABASE-query, the default collation changes to utf8mb4_general_ci.

MariaDB [(none)]> CREATE DATABASE `test-with-charset` CHARACTER SET utf8mb4 ;
MariaDB [(none)]> SELECT `DEFAULT_COLLATION_NAME` FROM `information_schema`.`SCHEMATA` WHERE `SCHEMA_NAME` LIKE 'test-with-charset';
+------------------------+
| DEFAULT_COLLATION_NAME |
+------------------------+
| utf8mb4_general_ci     |
+------------------------+

I already found out that (mysql-manual)

If CHARACTER SET charset_name is specified without COLLATE, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.

And indeed it shows utf8mb4_general_ci, so it is following the rules

MariaDB [(none)]> SHOW CHARACTER SET LIKE 'utf8mb4';
+---------+---------------+--------------------+--------+
| Charset | Description   | Default collation  | Maxlen |
+---------+---------------+--------------------+--------+
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci |      4 |
+---------+---------------+--------------------+--------+

So my question is: How do I change this default collation for the character set utf8mb4. Is there some configuration-file I can change to alter this behaviour? I'd really like those two to be consistent.

Off course I tried Google to find anything relevant, but all I can find is changing the collation_server-setting.

Server version: 10.3.15-MariaDB-log MariaDB Server

Was it helpful?

Solution

I don't think there is a way to change that DEFAULT.

Anyway, it would be better to use utf8mb4_unicode_520_ci, which is based on a later Unicode standard.

Just get into the habit of specifying CHARACTER SET and COLLATION on all connections and CREATE TABLEs. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci. MariaDB is not there yet, but I expect them to move soon. And "900" is probably not the last Unicode standard.

By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair.

A compromise...

But charset and collation on CREATE DATABASE. Then any tables built without specific settings will inherit those settings. And columns within that table will inherit from the table's settings.

OTHER TIPS

Option 1

  1. IF you are using MySQL 8.0.11 or newer (not sure how that equates to MariaDB 10.3.15), and
  2. IF you are ok using utf8mb4_0900_ai_ci instead of utf8mb4_unicode_ci

then it seems that a server system variable — @@default_collation_for_utf8mb4 — was added in 8.0.11, but the only valid values are:

  • utf8mb4_general_ci
  • utf8mb4_0900_ai_ci

However, if you are seeing a default collation of utf8mb4_general_ci for utf8mb4 instead of utf8mb4_0900_ai_ci, then I am guessing that you don't have this new system variable.

Option 2

The documentation does show a mechanism for defining your own UCA collation, though it is unclear if this can be used to override a default. I can't test it, but it's worth looking into:

Putting that all together, the following might work (but again, I have no way to test):

<charset name="utf8mb4">
  <family>Unicode</family>
  <description>UTF-8 MB4 Unicode</description>
  <collation name="utf8mb4_unicode_ci" id="224">
    <flag>primary</flag>
    <flag>compiled</flag>
  </collation>
  <collation name="utf8mb4_general_ci" id="45">
    <flag>compiled</flag>
  </collation>
  <collation name="utf8mb4_bin"     id="46">
    <flag>binary</flag>
    <flag>compiled</flag>
  </collation>
  <collation name="utf8mb4_unicode_520_ci"     id="246">
    <flag>compiled</flag>
  </collation>
</charset>

Now, the documentation does state:

You must assign a unique ID number to each collation. The range of IDs from 1024 to 2047 is reserved for user-defined collations. To find the maximum of the currently used collation IDs, use this query:

     SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;

However, I used the actual IDs with the idea being that we are merely changing the default, not starting with a base collation and adding new rules. I found the IDs here;

https://github.com/mysql/mysql-server/blob/8.0/mysql-test/suite/engines/funcs/r/db_alter_collate_ascii.result

Option 3

If all else fails, I would post this question to the following MySQL forum as it looks like you will get rather authoritative answers (based on who is answering some of those questions):

MySQL Forums: Character Sets, Collation, Unicode

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top