Using TOP and getting different result sets
-
09-12-2020 - |
質問
I am trying to write an efficient query for deleting chunks of data. To this end I hoped to avoid an index scan by using the primary key to get the oldest records. However I'm seeing some unexpected results returned.
I hoped this
SELECT TOP 15 OrderID FROM [Order]
Would give me the oldest 15 records because I could rely on the primary key incrementing and therefore the order of storage would be low to high in the table.
However, this returns a different result set
SELECT TOP 15 OrderID FROM [Order] ORDER BY DateCreated ASC
Which seems to be a more accurate but more expensive way of getting the result I need.
Confusingly, this
SELECT TOP 15 * FROM [Order]
Gives a different set of OrderID s (PK) to this
SELECT TOP 15 OrderID FROM [Order]
I understand that http://msdn.microsoft.com/en-gb/library/ms189463.aspx explains that order cannot be guaranteed without an ORDER BY clause but expected the PK to order for me and can't explain the differences between the final two select clauses.
解決
Take a look at the plans. When you use SELECT *
it probably uses the clustered index, and when you only want one column, maybe there is a skinnier index to use.
Don't "expect" a certain order. If you don't tell SQL Server how to order, then it will use the most efficient way possible, and this can change due to probably more than 20 factors.
If you want a certain order, SAY SO. Please read #3 here:
Also, this post by Michael Swart may be an interesting read:
If you want your second query to be more efficient, you can consider creating an index on DateCreated
(you may want to include OrderID
- not sure of the present index structure).
For your actual goal of deleting n
rows at a time, oldest first, and assuming OrderID
is an IDENTITY
column (so order create date should roughly align with that), why not use this approach (based on this great blog post, also by Michael Swart):
-- pick a datetime for the newest row you want to delete
-- let's say you want to delete all orders before Jan 1 2014:
SELECT @MaxOrderID = MAX(OrderID)
FROM dbo.[Order] -- terrible table name, also always use dbo prefix
WHERE DateCreated < '20140101';
DECLARE @BatchSize INT = 1000,
@LargestOrderProcessed INT = -1,
@NextBatchMax INT,
@RC INT = 1;
WHILE (@RC > 0)
BEGIN
SELECT TOP (@BatchSize) @NextBatchMax = OrderID
FROM dbo.[Order]
WHERE OrderID > @LargestOrderProcessed
AND OrderID <= @MaxOrderID
ORDER BY OrderID;
DELETE dbo.[Order]
WHERE OrderID > @LargestOrderProcessed
AND OrderID <= @NextBatchMax;
SET @RC = @@ROWCOUNT;
SET @LargestOrderProcessed = @NextBatchMax;
END
In order to minimize the impact on the log, you may want to add some additional logic in there, from my blog post Break large delete operations into chunks. As for the dbo
prefix, see Bad habits to kick : Avoiding the schema prefix.