Question

I have the following query in MySQL:

SELECT id FROM unicode WHERE `character` = 'a'

The table unicode contains each unicode character along with an ID (it's integer encoding value). Since the collation of the table is set to utf8_unicode_ci, I would have expected the above query to only return 97 (the letter 'a'). Instead, it returns 119 rows containing the IDs of many 'a'-like letters:

a A Ã ...

It seems to be ignoring both case and the multi-byte nature of the characters.

Any ideas?

Was it helpful?

Solution

As documented under Unicode Character Sets:

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.

The full collation chart makes clear that, in this collation, most variations of a base letter are equivalent irrespective of their lettercase or accent/decoration.

If you want to only match exact letters, you should use a binary collation such as utf8_bin.

OTHER TIPS

The collation of the table is part of the issue; MySQL with a _ci collation is treating all of those 'a's as variants of the same character.

Switching to a _cs collation will force the engine to distinguish 'a' from 'A', and 'á' from 'Á', but it may still treat 'a' and 'á' as the same character.

If you need exact comparison semantics, completely disregarding the equivalency of similar characters, you can use the BINARY comparison operators

SELECT id FROM unicode WHERE BINARY character = 'a'

The ci in the collation means case-insensitive. Switch to a case-sensitive collation (cs) to get the results you're looking for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top