INSTR(str,substr) does not work when str contains 'é' or 'ë' and substr only 'e'

StackOverflow https://stackoverflow.com/questions/20923186

  •  24-09-2022
  •  | 
  •  

سؤال

In another post on stackoverflow, I read that INSTR could be used to order results by relevance.

My understanding of col LIKE '%str%' andINSTR(col, 'str')` is that they both behave the same. There seems to be a difference in how collations are handled.

CREATE TABLE `users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

INSERT INTO users (name)
VALUES ('Joël'), ('René');

SELECT * FROM users WHERE name LIKE '%joel%'; -- 1 record returned
SELECT * FROM users WHERE name LIKE '%rene%'; -- 1 record returned
SELECT * FROM users WHERE INSTR(name, 'joel') > 0; -- 0 records returned
SELECT * FROM users WHERE INSTR(name, 'rene') > 0; -- 0 records returned
SELECT * FROM users WHERE INSTR(name, 'joël') > 0; -- 1 record returned
SELECT * FROM users WHERE INSTR(name, 'rené') > 0; -- 1 record returned

Although INSTR does some conversion, it finds ë in é.

SELECT INSTR('é', 'ë'), INSTR('é', 'e'), INSTR('e', 'ë');
-- returns 1, 0, 0

Am I missing something?

http://sqlfiddle.com/#!2/9bf21/6 (using mysql-version: 5.5.22)

هل كانت مفيدة؟

المحلول

This is due to bug 70767 on LOCATE() and INSTR(), which has been verified.

Though the INSTR() documentation states that it can be used for multi-byte strings, it doesn't seem to work, as you note, with collations like utf8_general_ci, which should be case and accent insensitive

This function is multi-byte safe, and is case sensitive only if at least one argument is a binary string.

The bug report states that although MySQL does this correctly it only does so when the number of bytes is also identical:

However, you can easily observe that they do not (completely) respect collations when looking for one string inside another one. It seems that what's happening is that MySQL looks for a substring which is collation-equal to the target which has exactly the same length in bytes as the target. This is only rarely true.

To pervert the reports example, if you create the following table:

create table t ( needle varchar(10), haystack varchar(10)
                  ) COLLATE=utf8_general_ci;
insert into t values ("A", "a"), ("A", "XaX");
insert into t values ("A", "á"), ("A", "XáX");
insert into t values ("Á", "a"), ("Á", "XaX");
insert into t values ("Å", "á"), ("Å", "XáX");

then run this query, you can see the same behaviour demonstrated:

select needle
     , haystack
     , needle=haystack as `=`
     , haystack LIKE CONCAT('%',needle,'%') as `like`
     , instr(needle, haystack) as `instr`
  from t;

SQL Fiddle

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top