Understanding when to remove Order By or Sort operator from Query/plan

https://dba.stackexchange.com/questions/250602

15-02-2021
|

Question

While i was reading and per understanding one needs to avoid unwanted ORDER BY in SQL queries if matching index to support query has key columns sorted the same way.

For below DB test schema--

CREATE PARTITION FUNCTION DemoPartitionFunction (datetime)
AS RANGE RIGHT
FOR VALUES (DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -7),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -6),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -5),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -4),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -3),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -2),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), -1),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 0),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 1),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 2),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 3),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 4),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 5),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 6),
            DATEADD(dd, DATEDIFF(dd, 0, GETUTCDATE()), 7));

CREATE PARTITION SCHEME DemoPartitionScheme
AS PARTITION DemoPartitionFunction
ALL TO ([DEFAULT]);

CREATE TABLE [dbo].[DemoPartitionedTable](
    [DemoID] [int] IDENTITY(1,1) NOT NULL,
    [SomeData] [sysname] NOT NULL,
    [Lastseen] [datetime] NULL,
    [DataKey1] [char] NOT NULL,
    [DataKey2] [char] NOT NULL,
    [RandomColumn] [nvarchar] NOT NULL,
    [CaptureDate] [datetime] NULL,
    CONSTRAINT [PK_DemoPartitionedTable] UNIQUE NONCLUSTERED 
    (
        [DemoID] ASC,
        [CaptureDate] ASC
    )
    ON DemoPartitionScheme(CaptureDate)
) ON DemoPartitionScheme(CaptureDate);

If i run the Query with ORDER BY

SELECT [DemoID], [CaptureDate]
FROM [dbo].[DemoPartitionedTable]
WHERE CaptureDate>=CONVERT(datetime, '20190912', 112) AND
CaptureDate < CONVERT(datetime, '20191013', 112)
ORDER BY 
DemoID,
CaptureDate

Below are the stats from IO and Time

(95703 rows affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 117, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'DemoPartitionedTable'. Scan count 16, logical reads 330, physical reads 9, read-ahead reads 262, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times: CPU time = 297 ms, elapsed time = 750 ms.

As per my understanding in above scenario we do not need order by as index have those columns already sorted, so why do i see a bad plan for below

so if i run removing the sorting above

SELECT [DemoID], [CaptureDate]
FROM 
[dbo].[DemoPartitionedTable]
WHERE CaptureDate>=CONVERT(datetime, '20190912', 112) AND
CaptureDate < CONVERT(datetime, '20191013', 112)
--ORDER BY 
--DemoID,
--CaptureDate

(95703 rows affected) Table 'DemoPartitionedTable'. Scan count 16, logical reads 330, physical reads 134, read-ahead reads 269, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times: CPU time = 78 ms, elapsed time = 2791 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.

Plan does not have an expensive sort but stats got worst, why?

Solution

As per my understanding in above scenario we do not need order by as index have those columns already sorted

That understanding is incorrect, and that plan shows one reason why. A parallel index scan doesn't output rows in index order, as each thread reads at a different location in the sort order. You can't expect rows in any particular order without an ORDER BY clause.

And the index does not have the rows already sorted. The ORDER BY is on (DemoID, CaptureDate), but the index is partitioned by CaptureDate. So if the query crosses a partition bondary, the DemoID values will start over.

eg the rows in (DemoID, CaptureDate)-order:

(1,'20190912'),(2,'20190913'),(3,'20190912'),(4,'20190913')

may be stored over more than one partition:

partition N
--------------
(1,'20190912')
(3,'20190912')

partition N+1
--------------
(2,'20190913')
(4,'20190913')

So even if the plan used a single thread to scan the index, a downstream sort would be required.

Plan does not have an expensive sort but stats got worst, why?

No. The stats got better. The second query has the same 330 logical reads and only 78ms of CPU time vs 297ms of CPU time for the first query. The difference in elapsed time is related to more parallelism and the physical IO, which is not an attribute of the query plan. Rather it's dependent on the state of the page cache when you run the query.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange