German umlauts and UTF8 collations, revisited

https://stackoverflow.com/questions/20953587

24-09-2022
|

Domanda

As I am sure a lot of people here are aware, having to deal with German umlauts and UTF8 collations can be problematic to say the least. Stuff like a = ä, o = ö, u = ü is not only capable of affecting the sort order of the results but the actual results as well. Here is an example which clearly demonstrates how things can go wrong by simply trying to make a distinction between a singular and plural version of a noun (Bademantel - singular, Bademäntel - plural).

CREATE TABLE keywords (
    id INT (11) PRIMARY KEY AUTO_INCREMENT,
    keyword VARCHAR (255) NOT NULL
) ENGINE = MyISAM DEFAULT CHARACTER
SET = utf8 COLLATE = utf8_unicode_ci;

INSERT INTO keywords (keyword) VALUES ('Bademantel'), ('Bademäntel');

SELECT * FROM keywords WHERE keyword LIKE ('%Bademäntel%');

Results should be

+----+------------+
| id | keyword    |
+----+------------+
|  1 | Bademäntel |
+----+------------+

yet with utf8_unicode_ci the output is

+----+------------+
| id | keyword    |
+----+------------+
|  1 | Bademantel |
|  2 | Bademäntel |
+----+------------+

which is clearly not the required result.

The actual problem is tied for my current project. It involves writing a keyword parser which is basically supposed to replace every occurrence of a keyword on the website with a link to the appropriate product page. In order to avoid unnecessary waste of resources only distinct keywords are fetched but using either

SELECT keyword FROM keywords GROUP BY keyword ORDER BY LENGTH(keyword) DESC

SELECT DISTINCT keyword FROM keywords ORDER BY LENGTH(keyword) DESC

will result in failing to process (link) all the non-umlaut versions of the words simply because they are not fetched during the query (i.e. all the keywords containing Bademäntel will be fetched but Bademantel will be omitted).

Now I realize that I have a couple of options to resolve this problem.

1) Use utf8_swedish_ci for the keywords table or during the queries which would effectively save me from having to modify a lot of existing code.

SELECT DISTINCT keyword COLLATE utf8_swedish_ci AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;

Unfortunately I am not that reluctant to abandon utf8_unicode_ci because a) it offers a really nice feature of sorting "Eszett" (ss and ß are considered the same), b) somehow it simply feels wrong to use a Swedish collation to handle German related stuff.

2) Modify the existing code to make use of utf8_bin.

SELECT DISTINCT keyword COLLATE utf8_bin AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;

This works as intended but it has a nasty drawback that all comparison is case-sensitive which means that if I decided to rely on utf8_bin as a solution for the problem I would have a hard time doing case-insensitive queries like LIKE('%Mäntel%') which would most definitely omit records like Bademäntel.

I know that this question pops every now and then on SO but some of the answers are now pretty old and I just want to know if there is some other solution that might have emerged in the meantime. I mean, I really can't get around the thought that a simple collation is allowed to completely change the results of a query. Sorting order yes, but the results itself?

Sorry for a bit longer post and thanks in advance for any kind of advice or comment.

Soluzione

For anyone else encountering this problem it's worth noting that since MySQL 5.6 there is official support for utf8_german2_ci collation which solves all of the above problems. Better late, than never I guess.

Altri suggerimenti

You could use a binary check using the keyword WHERE BINARY keyword = 'Bademantel'. The result would be the expected one.

Check out this sqlfiddle, which shows this:

SELECT * FROM stackoverflow WHERE BINARY keyword = 'Bademantel';

| id |    keyword |
|----|------------|
|  1 | Bademantel |

SELECT * FROM stackoverflow WHERE keyword = 'Bademantel';

| id |    keyword |
|----|------------|
|  1 | Bademantel |
|  2 | Bademäntel |

More about this behavior here: What effects does using a binary collation have? and here: What is the best MySQL collation for German language

So for applications with german umlauts or french grave accent or special chars in czech/polish language you have to decide which behavior is the best for your application.

Most cases will be ok with utf8_general_ci but sometimes you have to use utf8_bin for cases like your Bademantel.

The string comparison isn't bad at all, utf8_general_ci will help you sometimes. If you have saved a string like Straße - and you can search for Strasse which will also return Straße.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow