Why does a many to one merge join change the sort order of the data set?
Domanda
I have a query that calculates row_Number()
. The table has a clustered index on the same columns (and order) as the partitioning and ordering for the row number.
When using a merge join (many to one), a sort is required, even though the clustered index is in the correct order. Removing the join also removes the sort operation.
The clustered index that should power the row_number calculation:
create clustered index [ClusteredIndex_e060df3fbf464a8eb9b6ea5d46a9a5f5] on [dbo].[log1]
(
[client] asc,
[orderId] asc,
[campaign] asc,
[id] asc,
[DateStamp] asc
)
create clustered index [ClusteredIndex_dd0ee53e050d436cba2cab7c678a39e5] on [dbo].[LiveReference]
(
[client] asc,
[orderId] asc,
[campaign] asc
)
The query:
with cr as
(
select distinct client, orderId,campaign
from LiveReference
)
select e.[DateStamp]
,e.[campaign]
,e.[client]
,e.[orderId]
,e.[ad]
,e.[id]
,e.[source]
,row_number() over (partition by e.[client] ,
e.[orderId] ,
e.[campaign] ,
e.[id]
order by e.[DateStamp]) as num
from [dbo].[log1] e
inner join cr on
e.client = cr.client
and e.campaign = cr.campaign
and e.orderId = cr.orderId
Removing the join also removes the sort:
select e.[DateStamp]
,e.[campaign]
,e.[client]
,e.[orderId]
,e.[ad]
,e.[id]
,e.[source]
,row_number() over (partition by e.[client] ,
e.[orderId] ,
e.[campaign] ,
e.[id]
order by e.[DateStamp]) as num
from [dbo].[log1] e
(I know this also removes the filtering performed by the join, but this doesn't explain why excluding those rows changes the order)
Why would the results of a sorted join then not be in the correct order?
Soluzione
Generally speaking, merge join (including merge join concatenation) only preserves the sort order of the join keys.
The merge join keys are client, campaign, orderId
. The required input sort order for the window function is client, orderId, campaign, id , datestamp
.
The merge join cannot therefore provide the sort order required by your window function. You could avoid the sort with a nested loops join (e.g. using a hint).
I wrote about the details in Avoiding Sorts with Merge Join Concatenation.
Altri suggerimenti
The query can be rewritten as,
select e.[DateStamp]
,e.[campaign]
,e.[client]
,e.[orderId]
,e.[ad]
,e.[id]
,e.[source]
,row_number() over (partition by e.[client] ,
e.[orderId] ,
e.[campaign] ,
e.[id]
order by e.[DateStamp]) as num
from [dbo].[log1] e
where exists(select 1 from LiveReference
cr where
e.client = cr.client
and e.campaign = cr.campaign
and e.orderId = cr.orderId
)
So no need of CTE
and distinct