Question

I have a query that calculates row_Number(). The table has a clustered index on the same columns (and order) as the partitioning and ordering for the row number.

When using a merge join (many to one), a sort is required, even though the clustered index is in the correct order. Removing the join also removes the sort operation.

The clustered index that should power the row_number calculation:

create clustered index [ClusteredIndex_e060df3fbf464a8eb9b6ea5d46a9a5f5] on [dbo].[log1]
(
    [client] asc,
    [orderId] asc,
    [campaign] asc,
    [id] asc,
    [DateStamp] asc
)

create clustered index [ClusteredIndex_dd0ee53e050d436cba2cab7c678a39e5] on [dbo].[LiveReference]
(
    [client] asc,
    [orderId] asc,
    [campaign] asc
)

The query:

    with cr as 
(
    select distinct client, orderId,campaign
    from LiveReference
)

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 
inner join cr on 
                        e.client = cr.client
                        and e.campaign = cr.campaign
                        and e.orderId =  cr.orderId

which gives the below plan: enter image description here

Removing the join also removes the sort:

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 

(I know this also removes the filtering performed by the join, but this doesn't explain why excluding those rows changes the order)

enter image description here

Why would the results of a sorted join then not be in the correct order?

Was it helpful?

Solution

Generally speaking, merge join (including merge join concatenation) only preserves the sort order of the join keys.

The merge join keys are client, campaign, orderId. The required input sort order for the window function is client, orderId, campaign, id , datestamp.

The merge join cannot therefore provide the sort order required by your window function. You could avoid the sort with a nested loops join (e.g. using a hint).

nested loops plan

I wrote about the details in Avoiding Sorts with Merge Join Concatenation.

OTHER TIPS

The query can be rewritten as,

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 
where exists(select 1 from LiveReference
 cr where
                        e.client = cr.client
                        and e.campaign = cr.campaign
                        and e.orderId =  cr.orderId
)

So no need of CTE and distinct

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top