MySQL GROUP BY not working correctly with umlaut characters?

https://stackoverflow.com/questions/16716722

30-05-2022
|

Domanda

I noticed a problem with GROUP BY in a query I am currently trying to debug. I have a DB table with the following structure (reduced from actual real life):

CREATE TABLE IF NOT EXISTS `product_variants` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `product_id` int(11) unsigned NOT NULL DEFAULT '0',
  `pid_merchant` varchar(50) NOT NULL,
  `checksum` char(32) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `checksum` (`checksum`),
  KEY `product_id` (`product_id`),
) ENGINE=InnoDB  DEFAULT CHARSET=latin1;

In this table, I have the following 2 rows (among many other millions):

INSERT INTO `product_variants` (`id`, `product_id`, `pid_merchant`, `checksum`) VALUES
(525555236, 628702710, 'ARTüöäß111', 'af5334b1193bf171580c70813ac83327'),
(525555241, 628702710, 'ARTüöäß222', 'cfe50fd9c3ca29fd957b839892313f82');

The query I'm currently debugging is attempting to find duplicate entries in this table based on pid_merchant, in a very simple matter:

SELECT count(*), pv.* FROM product_variants pv WHERE pv.pid_merchant != '' GROUP BY pv.pid_merchant HAVING count(*) > 1

My problem is that both these results match, even though the actual pid_merchant values are different - one ends in 111, the other in 222. Does anyone know how to approach this issue? I already tried grouping by MD5() and HEX(), by changing collation to latin1_german2_ci, by forcing binary or utf8 conversion and many others - pretty much all I could think of.

Another weird thing is that it seems to confuse the values of Y and Ü (capital U with umlaute) while grouping (e.g. ABC-Y and ABC-Ü are considered as identical when grouping).

The server is running MySQL 5.5 on Ubuntu x64:

mysqld  Ver 5.5.29-0ubuntu0.12.04.2-log for debian-linux-gnu on x86_64 ((Ubuntu))

Soluzione

This is not an umlaut (or accents generally) problem

It is how MySQL evaluates GROUP BY: it's non-standard and random. Standard SQL is this:

SELECT count(*),  pv.product_id, pv.pid_merchant
FROM product_variants pv
WHERE pv.pid_merchant != ''
GROUP BY pv.product_id, pv.pid_merchant
HAVING count(*) > 1

All non-aggregated columns should appear in the GROUP BY.

MySQL has "useful" MySQL extensions that remove this strict requirement. It happens often

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow