Why doesn't MySQL coerce the collation to the column-specified, when comparing to a literal?

https://dba.stackexchange.com/questions/232834

25-01-2021
|

Question

I have a table with a collation defined on a column:

CREATE TEMPORARY TABLE test_table (
  utf8_col VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin
);
INSERT INTO test_table VALUES('ã');

According to the MySQL coercibility rules, when a value from that column is compared to a literal:

SELECT utf8_col < _utf8mb4'ñ' FROM test_table;

the column collation coercibility value is 2;
the literal CCV is 4

this should cause the comparison to fail, because the column CCV has higher priority, and should use the utf8_bin collation, which can't handle utf8mb4 character sets.

The comparison is successful, however. Why is that?

La solution

Why does the above comparison succeed?

Because MySQL is being nice and auto-correcting a user error, but only because it can.

You are not wrong that:

the column CCV has higher priority

and:

the utf8_bin collation can't handle utf8mb4 character sets.

However, you are not seeing what you think you are seeing in your test because you are not testing with data that requires the "mb4" version of the "utf8" collations. You are testing with a "safe" value that fits into the non-supplementary character-capable utf8 charset. In this situation, MySQL simply uses the charset of the column — utf8 — and the utf8_bin collation is just fine.

The world looks a little different when you use a 4 byte character (i.e. a supplementary character). In this case, the character cannot exist in the utf8 charset (because the utf8 charset can only handle BMP characters / the first 65,536 code points / U+0000 through U+FFFF / characters that are 1 to 3 bytes each). And so, since the charset of the string literal cannot be changed to the one that does work with the column's collation — utf8_bin — you would get an error stating:

Illegal mix of collations (utf8_bin,IMPLICIT) and (utf8mb4_0900_ai_ci,COERCIBLE) for operation '<'

I tested with code point U+1F369 (i.e. the donut emoji: "🍩"):

CREATE TABLE test_table (
  utf8_col VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin
);

INSERT INTO test_table VALUES('ã'); # any utf8 character from U+0000 to U+FFFF

# Test against any utf8 character from U+0000 to U+FFFF in _utf8mb4
SELECT utf8_col,
       CHARSET(utf8_col) AS 'charset',
       COLLATION(utf8_col) AS 'collation',
       CONCAT(utf8_col, _utf8mb4'ñ') AS 'concat1',
       CHARSET(CONCAT(utf8_col, _utf8mb4'ñ')) AS 'charset_concat1'
FROM test_table;

# Test against any utf8 character from U+10000 to U+10FFFF in _utf8mb4
SELECT utf8_col,
       CHARSET(utf8_col) AS 'charset',
       COLLATION(utf8_col) AS 'collation',
       CONCAT(utf8_col, _utf8mb4'🍩') AS 'concat2',
       CHARSET(CONCAT(utf8_col, _utf8mb4'🍩')) AS 'charset_concat2'
FROM test_table;
# FAIL

SELECT utf8_col < _utf8mb4'ñ' FROM test_table; # Success


SELECT utf8_col < _utf8mb4'🍩' FROM test_table; # FAIL

See the above example code in action on dbfiddle.uk

The result set from the first SELECT is:

utf8_col    charset    collation    concat1    charset_concat1
ã           utf8       utf8_bin     ãñ         utf8

which shows that the utf8mb4 string was coerced into utf8, and everyone was happy. Was happy, until Mr Donut showed up in the second SELECT statement (you would think that donuts would make everything better, though, wouldn't you? I generally do 😸).

Licencié sous: CC-BY-SA avec attribution

Non affilié à dba.stackexchange