Simple Self Join Query Bad Performance
-
01-07-2021 - |
Question
Could anyone advice on how do I improve the performance of the following query. Note, the problem seems to be caused by where
clause.
Data (table contains a huge set of rows - 500K+, the set of parameters it's called with assums the return of 2-5K records per query, which takes 8-10 minutes currently):
USE [SomeDb]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Data](
[x] [money] NOT NULL,
[y] [money] NOT NULL,
CONSTRAINT [PK_Data] PRIMARY KEY CLUSTERED
(
[x] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
The Query
select top 10000
s.x as sx,
e.x as ex,
s.y as sy,
e.y as ey,
e.y - s.y as y_delta,
e.x - s.x as x_delta
from Data s
inner join Data e
on e.x > s.x and e.x - s.x between xFrom and xTo
--where e.y - s.y > @yDelta -- when uncommented causes a huge delay
Update 1 - Execution Plan
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.2" Build="11.0.2100.60" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
<BatchSequence>
<Batch>
<Statements>
<StmtSimple StatementCompId="1" StatementEstRows="100" StatementId="1" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" StatementSubTreeCost="0.0263655" StatementText="select top 100
s.x as sx,
e.x as ex,
s.y as sy,
e.y as ey,
e.y - s.y as y_delta,
e.x - s.x as x_delta
from Data s 
 inner join Data e
 on e.x > s.x and e.x - s.x between 100 and 105
where e.y - s.y > 0.01
" StatementType="SELECT" QueryHash="0xAAAC02AC2D78CB56" QueryPlanHash="0x747994153CB2D637" RetrievedFromCache="true">
<StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
<QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="NoParallelPlansInDesktopOrExpressEdition" CachedPlanSize="24" CompileTime="13" CompileCPU="13" CompileMemory="424">
<MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="52199" EstimatedPagesCached="14561" EstimatedAvailableDegreeOfParallelism="4" />
<RelOp AvgRowSize="55" EstimateCPU="1E-05" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="100" LogicalOp="Compute Scalar" NodeId="0" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.0263655">
<OutputList>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
<ColumnReference Column="Expr1004" />
<ColumnReference Column="Expr1005" />
</OutputList>
<ComputeScalar>
<DefinedValues>
<DefinedValue>
<ColumnReference Column="Expr1004" />
<ScalarOperator ScalarString="[SomeDb].[dbo].[Data].[y] as [e].[y]-[SomeDb].[dbo].[Data].[y] as [s].[y]">
<Arithmetic Operation="SUB">
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</Identifier>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
</DefinedValue>
<DefinedValue>
<ColumnReference Column="Expr1005" />
<ScalarOperator ScalarString="[SomeDb].[dbo].[Data].[x] as [e].[x]-[SomeDb].[dbo].[Data].[x] as [s].[x]">
<Arithmetic Operation="SUB">
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
</Identifier>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
</DefinedValue>
</DefinedValues>
<RelOp AvgRowSize="39" EstimateCPU="1E-05" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="100" LogicalOp="Top" NodeId="1" Parallel="false" PhysicalOp="Top" EstimatedTotalSubtreeCost="0.0263555">
<OutputList>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</OutputList>
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRows="100" ActualEndOfScans="1" ActualExecutions="1" />
</RunTimeInformation>
<Top RowCount="false" IsPercent="false" WithTies="false">
<TopExpression>
<ScalarOperator ScalarString="(100)">
<Const ConstValue="(100)" />
</ScalarOperator>
</TopExpression>
<RelOp AvgRowSize="39" EstimateCPU="151828" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="100" LogicalOp="Inner Join" NodeId="2" Parallel="false" PhysicalOp="Nested Loops" EstimatedTotalSubtreeCost="0.0263455">
<OutputList>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</OutputList>
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRows="100" ActualEndOfScans="0" ActualExecutions="1" />
</RunTimeInformation>
<NestedLoops Optimized="false">
<OuterReferences>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</OuterReferences>
<RelOp AvgRowSize="23" EstimateCPU="1.80448" EstimateIO="3.76461" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1" LogicalOp="Clustered Index Scan" NodeId="3" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.0032831" TableCardinality="1640290">
<OutputList>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</OutputList>
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRows="15225" ActualEndOfScans="0" ActualExecutions="1" />
</RunTimeInformation>
<IndexScan Ordered="false" ForcedIndex="false" ForceScan="false" NoExpandHint="false">
<DefinedValues>
<DefinedValue>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
</DefinedValue>
<DefinedValue>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</DefinedValue>
</DefinedValues>
<Object Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Index="[PK_Data]" Alias="[e]" IndexKind="Clustered" />
</IndexScan>
</RelOp>
<RelOp AvgRowSize="23" EstimateCPU="0.902317" EstimateIO="1.88387" EstimateRebinds="1" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="100" LogicalOp="Clustered Index Seek" NodeId="4" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.0263655" TableCardinality="1640290">
<OutputList>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
</OutputList>
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRows="100" ActualEndOfScans="15224" ActualExecutions="15225" />
</RunTimeInformation>
<IndexScan Ordered="true" ScanDirection="FORWARD" ForcedIndex="false" ForceSeek="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">
<DefinedValues>
<DefinedValue>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
</DefinedValue>
<DefinedValue>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
</DefinedValue>
</DefinedValues>
<Object Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Index="[PK_Data]" Alias="[s]" IndexKind="Clustered" />
<SeekPredicates>
<SeekPredicateNew>
<SeekKeys>
<EndRange ScanType="LT">
<RangeColumns>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
</RangeColumns>
<RangeExpressions>
<ScalarOperator ScalarString="[SomeDb].[dbo].[Data].[x] as [e].[x]">
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
</Identifier>
</ScalarOperator>
</RangeExpressions>
</EndRange>
</SeekKeys>
</SeekPredicateNew>
</SeekPredicates>
<Predicate>
<ScalarOperator ScalarString="([SomeDb].[dbo].[Data].[x] as [e].[x]-[SomeDb].[dbo].[Data].[x] as [s].[x])>=($100.0000) AND ([SomeDb].[dbo].[Data].[x] as [e].[x]-[SomeDb].[dbo].[Data].[x] as [s].[x])<=($105.0000) AND ([SomeDb].[dbo].[Data].[y] as [e].[y]-[SomeDb].[dbo].[Data].[y] as [s].[y])>(0.01)">
<Logical Operation="AND">
<ScalarOperator>
<Compare CompareOp="GE">
<ScalarOperator>
<Arithmetic Operation="SUB">
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
</Identifier>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
<ScalarOperator>
<Const ConstValue="($100.0000)" />
</ScalarOperator>
</Compare>
</ScalarOperator>
<ScalarOperator>
<Compare CompareOp="LE">
<ScalarOperator>
<Arithmetic Operation="SUB">
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="x" />
</Identifier>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="x" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
<ScalarOperator>
<Const ConstValue="($105.0000)" />
</ScalarOperator>
</Compare>
</ScalarOperator>
<ScalarOperator>
<Compare CompareOp="GT">
<ScalarOperator>
<Arithmetic Operation="SUB">
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[e]" Column="y" />
</Identifier>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Database="[SomeDb]" Schema="[dbo]" Table="[Data]" Alias="[s]" Column="y" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
<ScalarOperator>
<Const ConstValue="(0.01)" />
</ScalarOperator>
</Compare>
</ScalarOperator>
</Logical>
</ScalarOperator>
</Predicate>
</IndexScan>
</RelOp>
</NestedLoops>
</RelOp>
</Top>
</RelOp>
</ComputeScalar>
</RelOp>
</QueryPlan>
</StmtSimple>
</Statements>
</Batch>
</BatchSequence>
</ShowPlanXML>
Solution
I have often seen big performance gains by inserting the results of the first query (in your case without the where clause) into a TEMP table or a table variable, and selecting from this afterwards (which basically helps the query optimiser to select an appropriate execution plan).
Also just noticed you don't have an INDEX on column Y, which may speed up a bit.
EDIT Also, try the following (gives me slightly better performance):
SELECT * FROM
(SELECT
s.x as sx,
e.x as ex,
s.y as sy,
e.y as ey,
e.y - s.y as y_delta,
e.x - s.x as x_delta
FROM Data s
JOIN Data e
ON e.x > s.x
) data
WHERE data.y_delta > @yDelta AND data.x_delta BETWEEN @xFrom AND @xTo
OTHER TIPS
The main issues are that the where clause gives a cross join (not what I would call a simple join) (giving many rows is that the join compares many rows in e for a row in s) and also the .y comparison has to use a table scan (over at least temporary data if not the whole table).
If the query is common then there is a possible fix in doing the join once and copying the data into a pre calculated table and indexing the differences.
I can think of an odd strategy to improve performance. This involves adding an auto-incrementing id into the table, then finding the bounds for Xfrom and Xto in terms of the id, and then looking for anything in the range.
The following query suggests what I mean:
with e1 as (
select e.*,
(select min(id) from data s where e.x between s.x + @xfrom and s.x + @xto) as idstart,
(select max(id) from data s where e.x between s.x + @xfrom and s.x + @xto) as idend
from data e
)
select <whatever>
from e1 join
e1 s
on e1.idstart = s.id
where e1.idstart <= e1.idend union all
select <whatever>
from e1 join
e1 s
on e1.idstart+1 = s.id
where e1.idstart+1 <= e1.idend
. . .
I have had good luck with the correlated subquery on an indexed field returning the next value. After that, the joins are equijoins, which should be very fast. In the end, you would want to modify the query so it can do the comparisons using a table of enumerated values.