Using TOP and getting different result sets

https://dba.stackexchange.com/questions/76220

09-12-2020
|

Domanda

I am trying to write an efficient query for deleting chunks of data. To this end I hoped to avoid an index scan by using the primary key to get the oldest records. However I'm seeing some unexpected results returned.

I hoped this

SELECT TOP 15 OrderID FROM [Order]

Would give me the oldest 15 records because I could rely on the primary key incrementing and therefore the order of storage would be low to high in the table.

However, this returns a different result set

SELECT TOP 15 OrderID FROM [Order] ORDER BY DateCreated ASC

Which seems to be a more accurate but more expensive way of getting the result I need.

Confusingly, this

SELECT TOP 15 * FROM [Order]

Gives a different set of OrderID s (PK) to this

SELECT TOP 15 OrderID FROM [Order]

I understand that http://msdn.microsoft.com/en-gb/library/ms189463.aspx explains that order cannot be guaranteed without an ORDER BY clause but expected the PK to order for me and can't explain the differences between the final two select clauses.

Soluzione

Take a look at the plans. When you use SELECT * it probably uses the clustered index, and when you only want one column, maybe there is a skinnier index to use.

Don't "expect" a certain order. If you don't tell SQL Server how to order, then it will use the most efficient way possible, and this can change due to probably more than 20 factors.

If you want a certain order, SAY SO. Please read #3 here:

T-SQL Tuesday #56 : SQL Server Assumptions

Also, this post by Michael Swart may be an interesting read:

Without ORDER BY, You Can’t Depend On the Order of Results

If you want your second query to be more efficient, you can consider creating an index on DateCreated (you may want to include OrderID - not sure of the present index structure).

For your actual goal of deleting n rows at a time, oldest first, and assuming OrderID is an IDENTITY column (so order create date should roughly align with that), why not use this approach (based on this great blog post, also by Michael Swart):

-- pick a datetime for the newest row you want to delete
-- let's say you want to delete all orders before Jan 1 2014:

SELECT @MaxOrderID = MAX(OrderID)
  FROM dbo.[Order] -- terrible table name, also always use dbo prefix
  WHERE DateCreated < '20140101';

DECLARE @BatchSize INT = 1000,
        @LargestOrderProcessed INT = -1,
        @NextBatchMax INT,
        @RC INT = 1;

WHILE (@RC > 0)
BEGIN
  SELECT TOP (@BatchSize) @NextBatchMax = OrderID
    FROM dbo.[Order]
    WHERE OrderID > @LargestOrderProcessed
    AND OrderID <= @MaxOrderID
    ORDER BY OrderID;

  DELETE dbo.[Order]
    WHERE OrderID > @LargestOrderProcessed
    AND OrderID <= @NextBatchMax;

  SET @RC = @@ROWCOUNT;
  SET @LargestOrderProcessed = @NextBatchMax;
END

In order to minimize the impact on the log, you may want to add some additional logic in there, from my blog post Break large delete operations into chunks. As for the dbo prefix, see Bad habits to kick : Avoiding the schema prefix.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange