Question

I have a table of all defined Unicode characters (the character column) and their associated Unicode points (the id column). I have the following query:

SELECT id FROM unicode WHERE `character` IN ('A', 'B', 'C')

While this query should return only 3 rows (id = 65, 66, 67), it instead returns 129 rows including the following IDs:

65 66 67 97 98 99 129 141 143 144 157 160 193 205 207 208 221 224 257 269 271 272 285 288 321 333 335 336 349 352 449 461 463 464 477 480 2049 2061 2063 2064 2077 2080 4161 4173 4175 4176 4189 4192 4929 4941 4943 4944 4957 4960 5057 5069 5071 5072 5085 5088 5121 5133 5135 5136 5149 5152 5953 5965 5967 5968 5984 6145 6157 6160 6176 8257 8269 8271 8272 8285 8288 9025 9037 9039 9040 9053 9056 9153 9165 9167 9168 9181 9184 9217 9229 9231 9232 9245 9248 10049 10061 10063 10064 10077 10080 10241 10253 10255 10256 10269 10272 12353 12365 12367 12368 12381 12384 13121 13133 13135 13136 13149 13152 13249 13261 13263 13264 13277 13280

I'm sure this must have something to do with multi-byte characters but I'm not sure how to fix it. Any ideas what's going on here?

Was it helpful?

Solution

String equality and order is governed by a collation. By default the collation used is determined from the column, but you can set the collation per-query with the COLLATE clause. For example, if your columns are declared with charset utf8 you could use utf8_bin to use a binary collation that considers A and à different:

SELECT id FROM unicode WHERE `character` COLLATE utf8_bin IN ('A', 'B', 'C')

Alternatively you could use the BINARY operator to convert character into a "binary string" which forces the use of a binary comparison, which is almost but not quite the same as binary collation:

SELECT id FROM unicode WHERE BINARY `character` IN ('A', 'B', 'C')

Update: I thought that the following should be equivalent, but it's not because a column has lower "coercibility" than the constants. The binary string constants would be converted into non-binary and then compared.

SELECT id FROM unicode WHERE `character` IN (_binary'A', _binary'B', _binary'C')

OTHER TIPS

You can try:

SELECT id FROM unicode WHERE 'character' IN (_utf8'A',_utf8'B',_utf8'C')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top