Why is an aggregate query significantly faster with a GROUP BY clause than without one?

https://dba.stackexchange.com/questions/15295

16-10-2019
|

Pergunta

I'm just curious why an aggregate query runs so much faster with a GROUP BY clause than without one.

For example, this query takes almost 10 seconds to run

SELECT MIN(CreatedDate)
FROM MyTable
WHERE SomeIndexedValue = 1

While this one takes less than a second

SELECT MIN(CreatedDate)
FROM MyTable
WHERE SomeIndexedValue = 1
GROUP BY CreatedDate

There is only one CreatedDate in this case, so the grouped query returns the same results as the ungrouped one.

I noticed the execution plans for the two queries are different - The second query uses Parallelism while the first query does not.

Query1 Execution Plan Query2 Execution Plan

Is it normal for SQL server to evaluate an aggregate query differently if it doesn't have a GROUP BY clause? And is there something I can do to improve the performance of the 1st query without using a GROUP BY clause?

Edit

I just learned I can use OPTION(querytraceon 8649) to set the cost overhead of parallelism to 0, which makes makes the query use some parallelism and reduces the runtime to 2 seconds, although I don't know if there's any downsides to using this query hint.

SELECT MIN(CreatedDate)
FROM MyTable
WHERE SomeIndexedValue = 1
OPTION(querytraceon 8649)

enter image description here

I'd still prefer a shorter runtime since the query is meant to populate a value upon user selection, so should ideally be instantaneous like the grouped query is. Right now I'm just wrapping my query, but I know that's not really an ideal solution.

SELECT Min(CreatedDate)
FROM
(
    SELECT Min(CreatedDate) as CreatedDate
    FROM MyTable WITH (NOLOCK) 
    WHERE SomeIndexedValue = 1
    GROUP BY CreatedDate
) as T

Edit #2

In response to Martin's request for more info:

Both CreatedDate and SomeIndexedValue have a separate non-unique, non-clustered index on them. SomeIndexedValue is actually a varchar(7) field, even though it stores a numeric value that points to the PK (int) of another table. The relationship between the two tables is not defined in the database. I am not supposed to change the database at all, and can only write queries that query data.

MyTable contains over 3 million records, and each record is assigned a group it belongs to (SomeIndexedValue). The groups can be anywhere from 1 to 200,000 records

Solução

It looks like it is probably following an index on CreatedDate in order from lowest to highest and doing lookups to evaluate the SomeIndexedValue = 1 predicate.

When it finds the first matching row it is done, but it may well be doing many more lookups than it expects before it finds such a row (it assumes the rows matching the predicate are randomly distributed according to date.)

See my answer here for a similar issue

The ideal index for this query would be one on SomeIndexedValue, CreatedDate. Assuming that you can't add that or at least make your existing index on SomeIndexedValue cover CreatedDate as an included column then you could try rewriting the query as follows

SELECT MIN(DATEADD(DAY, 0, CreatedDate)) AS CreatedDate
FROM MyTable
WHERE SomeIndexedValue = 1

to prevent it from using that particular plan.

Outras dicas

Can we control for MAXDOP and choose a known table, e.g., AdventureWorks.Production.TransactionHistory?

When I repeat your setup using

--#1
SELECT MIN(TransactionDate) 
FROM AdventureWorks.Production.TransactionHistory
WHERE TransactionID = 100001 
OPTION( MAXDOP 1) ;

--#2
SELECT MIN(TransactionDate) 
FROM AdventureWorks.Production.TransactionHistory
WHERE TransactionID = 100001 
GROUP BY TransactionDate
OPTION( MAXDOP 1) ;
GO

the costs are identical.

As an aside, I would expect (make it happen) an index seek on your indexed value; otherwise, you are likely going to see hash matches instead of stream aggregates. You can improve performance with non-clustered indexes that include the values that you are aggregating and or create an indexed view that defines your aggregates as columns. Then you would be hitting a clustered index, which contains your aggregations, by an Indexed Id. In SQL Standard, you can just create the view and use the WITH (NOEXPAND) hint.

An example (I do not use MIN, since it does not work in indexed views):

USE AdventureWorks ;
GO

-- Covering Index with Include
CREATE INDEX IX_CoverAndInclude
ON Production.TransactionHistory(TransactionDate) 
INCLUDE (Quantity) ;
GO

-- Indexed View
CREATE VIEW dbo.SumofQtyByTransDate
    WITH SCHEMABINDING
AS
SELECT 
      TransactionDate 
    , COUNT_BIG(*) AS NumberOfTransactions
    , SUM(Quantity) AS TotalTransactions
FROM Production.TransactionHistory
GROUP BY TransactionDate ;
GO

CREATE UNIQUE CLUSTERED INDEX SumofAllChargesIndex 
    ON dbo.SumofQtyByTransDate (TransactionDate) ;  
GO


--#1
SELECT SUM(Quantity) 
FROM AdventureWorks.Production.TransactionHistory 
WITH (INDEX(0))
WHERE TransactionID = 100001 
OPTION( MAXDOP 1) ;

--#2
SELECT SUM(Quantity)  
FROM AdventureWorks.Production.TransactionHistory 
WITH (INDEX(IX_CoverAndInclude))
WHERE TransactionID = 100001 
GROUP BY TransactionDate
OPTION( MAXDOP 1) ;
GO 

--#3
SELECT SUM(Quantity)  
FROM AdventureWorks.Production.TransactionHistory
WHERE TransactionID = 100001 
GROUP BY TransactionDate
OPTION( MAXDOP 1) ;
GO

In my opinion the reason for the problem is that the sql server optimizer is not looking for the BEST plan rather it looks for a good plan, as is evident from the fact that after forcing parallelism the query executed much faster, something that the optimizer had not done on it's own.

I have also seen many situations where rewriting the query in a different format was the difference between parallelizing (for example although most articles on SQL recommend parameterizing I have found it to cause sometimes noy to parallelize even when the parameters sniffed were the same as a non- parallelized one, or combining two queries with UNION ALL can sometimes eliminate parallelization).

As such the correct solution might be by trying different ways of writing the query, such as trying temp tables, table variables, cte, derived tables, parameterizing, and so on, and also playing with the indexes, indexed views, or filtered indexes in order to get the best plan.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a dba.stackexchange