Why doesn't MySQL coerce the collation to the column-specified, when comparing to a literal?
-
25-01-2021 - |
Question
I have a table with a collation defined on a column:
CREATE TEMPORARY TABLE test_table (
utf8_col VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin
);
INSERT INTO test_table VALUES('ã');
According to the MySQL coercibility rules, when a value from that column is compared to a literal:
SELECT utf8_col < _utf8mb4'ñ' FROM test_table;
- the column collation coercibility value is 2;
- the literal CCV is 4
this should cause the comparison to fail, because the column CCV has higher priority, and should use the utf8_bin
collation, which can't handle utf8mb4
character sets.
The comparison is successful, however. Why is that?
La solution
Why does the above comparison succeed?
Because MySQL is being nice and auto-correcting a user error, but only because it can.
You are not wrong that:
the column CCV has higher priority
and:
the utf8_bin collation can't handle utf8mb4 character sets.
However, you are not seeing what you think you are seeing in your test because you are not testing with data that requires the "mb4" version of the "utf8" collations. You are testing with a "safe" value that fits into the non-supplementary character-capable utf8
charset. In this situation, MySQL simply uses the charset of the column — utf8
— and the utf8_bin
collation is just fine.
The world looks a little different when you use a 4 byte character (i.e. a supplementary character). In this case, the character cannot exist in the utf8
charset (because the utf8
charset can only handle BMP characters / the first 65,536 code points / U+0000 through U+FFFF / characters that are 1 to 3 bytes each). And so, since the charset of the string literal cannot be changed to the one that does work with the column's collation — utf8_bin
— you would get an error stating:
Illegal mix of collations (utf8_bin,IMPLICIT) and (utf8mb4_0900_ai_ci,COERCIBLE) for operation '<'
I tested with code point U+1F369 (i.e. the donut emoji: "🍩"):
CREATE TABLE test_table (
utf8_col VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin
);
INSERT INTO test_table VALUES('ã'); # any utf8 character from U+0000 to U+FFFF
# Test against any utf8 character from U+0000 to U+FFFF in _utf8mb4
SELECT utf8_col,
CHARSET(utf8_col) AS 'charset',
COLLATION(utf8_col) AS 'collation',
CONCAT(utf8_col, _utf8mb4'ñ') AS 'concat1',
CHARSET(CONCAT(utf8_col, _utf8mb4'ñ')) AS 'charset_concat1'
FROM test_table;
# Test against any utf8 character from U+10000 to U+10FFFF in _utf8mb4
SELECT utf8_col,
CHARSET(utf8_col) AS 'charset',
COLLATION(utf8_col) AS 'collation',
CONCAT(utf8_col, _utf8mb4'🍩') AS 'concat2',
CHARSET(CONCAT(utf8_col, _utf8mb4'🍩')) AS 'charset_concat2'
FROM test_table;
# FAIL
SELECT utf8_col < _utf8mb4'ñ' FROM test_table; # Success
SELECT utf8_col < _utf8mb4'🍩' FROM test_table; # FAIL
See the above example code in action on dbfiddle.uk
The result set from the first SELECT
is:
utf8_col charset collation concat1 charset_concat1
ã utf8 utf8_bin ãñ utf8
which shows that the utf8mb4
string was coerced into utf8
, and everyone was happy. Was happy, until Mr Donut showed up in the second SELECT
statement (you would think that donuts would make everything better, though, wouldn't you? I generally do 😸).