Find groups with exactly one value: COUNT(DISTINCT x) = 1 vs MIN(x) = MAX(x)

https://dba.stackexchange.com/questions/224998

18-01-2021
|

Вопрос

Given this data:

gid | val
1   | a
1   | a
1   | a
2   | b
3   | x
3   | y
3   | z

the following queries return groups (gid) that contain exactly one distinct value (val):

SELECT gid FROM t GROUP BY gid HAVING MIN(val) = MAX(val)
SELECT gid FROM t GROUP BY gid HAVING COUNT(DISTINCT val) = 1

People seem to suggest that the first variant would be faster (if assuming appropriate indexes exist then looking up MIN and MAX would be faster than counting all values). Is that a fact or a myth.

Решение

The short version is that you should expect MIN(val) = MAX(val) to be better for rowstore queries in all cases and COUNT(DISTINCT val) = 1 to be better for columnstore queries when val is a string column.

I put 6.4 million rows into a table to test with. The data has a roughly similar data distribution to your sample data:

DROP TABLE IF EXISTS dbo.t224998_2;

CREATE TABLE dbo.t224998_2 (
gid INT,
val VARCHAR(20)
);

INSERT INTO dbo.t224998_2 WITH (TABLOCK)
SELECT 
CASE WHEN q.RN % 7 <= 2 THEN q.RN - q.RN % 7
WHEN q.RN % 7 >= 4 THEN q.RN - q.RN % 7 + 4
ELSE q.RN END
, CHAR(65 + q.RN % 7)
FROM
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 RN 
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q;

Here are the first 14 rows in the table:

Here are the query plans when running your pair of queries without any indexes:

The query with MIN and MAX only has a single hash aggregate operator. The COUNT(DISTINCT) query has two. For the second query, the rightmost operator only keeps the distinct rows and the leftmost operator performs the count. Not surprisingly, the DISTINCT query is about twice as slow.

Creating the following index makes the two queries more competitive:

CREATE INDEX IX1 ON dbo.t224998_2 (gid, val);

Now the plans look like this:

Now the distinct query is about 25% slower. It should be stressed here that for none of these plans are the MIN and MAX values "looked up". You're querying the table without a filter. SQL Server is going to scan all of the rows of the index or table. The index is useful because it can be scanned in key order and that allows the aggregate to be calculated more efficiently. For the MIN(val) = MAX(val) query, the stream aggregate reads the ordered rows and keeps track of the min and max value seen for each unique value of gid. It passes the row along to the next operator when it finds a new value for gid. At no point is an index seek performed to get the minimum or maximum value. You can write a query to do that but it's somewhat convoluted.

The COUNT(DISTINCT) query again splits the work up into two aggregates. Both aggregates take advantage of the ordering of the index. The rightmost one removes duplicate rows and the leftmost one performs the count.

If I change the table to a columnstore one without any nonclustered indexes the second query becomes the winner. Here are the plans:

The MIN(val) = MAX(val) performs all of the aggregate work in row mode. Aggregates that returns string columns are not supported by batch mode. This restriction is documented by Microsoft. COUNT(DISTINCT val) is supported by batch mode and as a result all of the aggregate work is performed in batch mode. That query is over twice as fast as the MIN(val) = MAX(val) option.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с dba.stackexchange