Why does a many to one merge join change the sort order of the data set?

https://dba.stackexchange.com/questions/203226

29-12-2020
|

Question

I have a query that calculates row_Number(). The table has a clustered index on the same columns (and order) as the partitioning and ordering for the row number.

When using a merge join (many to one), a sort is required, even though the clustered index is in the correct order. Removing the join also removes the sort operation.

The clustered index that should power the row_number calculation:

create clustered index [ClusteredIndex_e060df3fbf464a8eb9b6ea5d46a9a5f5] on [dbo].[log1]
(
    [client] asc,
    [orderId] asc,
    [campaign] asc,
    [id] asc,
    [DateStamp] asc
)

create clustered index [ClusteredIndex_dd0ee53e050d436cba2cab7c678a39e5] on [dbo].[LiveReference]
(
    [client] asc,
    [orderId] asc,
    [campaign] asc
)

The query:

    with cr as 
(
    select distinct client, orderId,campaign
    from LiveReference
)

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 
inner join cr on 
                        e.client = cr.client
                        and e.campaign = cr.campaign
                        and e.orderId =  cr.orderId

which gives the below plan:

Removing the join also removes the sort:

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e

(I know this also removes the filtering performed by the join, but this doesn't explain why excluding those rows changes the order)

Why would the results of a sorted join then not be in the correct order?

Solution

Generally speaking, merge join (including merge join concatenation) only preserves the sort order of the join keys.

The merge join keys are client, campaign, orderId. The required input sort order for the window function is client, orderId, campaign, id , datestamp.

The merge join cannot therefore provide the sort order required by your window function. You could avoid the sort with a nested loops join (e.g. using a hint).

I wrote about the details in Avoiding Sorts with Merge Join Concatenation.

OTHER TIPS

The query can be rewritten as,

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 
where exists(select 1 from LiveReference
 cr where
                        e.client = cr.client
                        and e.campaign = cr.campaign
                        and e.orderId =  cr.orderId
)

So no need of CTE and distinct

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange