Query tuning update with hash flow distinct operator
-
06-03-2021 - |
Question
Need some help in understanding the slowness in one of the UPDATE
statements like below:-
UPDATE TOP (100) xyz
SET xyz.flag = 1
OUTPUT inserted.Rcode, inserted.EDR, inserted.id, abc.EID,abc.CID,abc.ENID,abc.Cdate
FROM dbo.table1 xyz WITH (UPDLOCK, READPAST)
INNER JOIN dbo.table2 abc WITH (NOLOCK)
on xyz.id=abc.id
WHERE xyz.flag = 0
Table1 has approx. 0.5 million rows and Table 2 has approx. 5 million rows
Slow Plan
Hash Match distinct flow operator shows a yellow alarm and message is:
Operator used Tempdb to spill data for executed with spill level 4 with 1 spilled thread"
Build residual:
database.dbo.table2.id as abc.id = database.dbo.table2.id as abc.id
I took a screenshot. Unfortunately due to security reasons I can't provide more than that, not even an anonymized plan. From my working station I cannot access the internet so there is no way I can get plan explorer to run there.
Generally for a smaller subset of rows it's under sec like when we have just matching 10K rows or something. But with higher amount of data this seems to be tipping point and app cannot afford 1 min of run time. From SSMS I get 30 secs but from app we have avg. 50 secs approx. RCSI is in testing phase.
My good plan does not have that Hash Match Flow Distinct operator visible as shown in my screenshot, while rest of plan remains same. Good one completes under 3 secs or so. As seen nearly 16 seconds are spent on that operator. Can we eliminate it via proper indexing or query re-write?
Table schema
CREATE TABLE dbo.table1
(
Recid VARCHAR(128) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
Cdate DATETIME NULL,
flag BIT NULL DEFAULT (0),
Rcode INT NULL,
EDR VARCHAR(255) COLLATE SQL_Latin1_general_CP1_CI_AS NULL,
id BIGINT NULL
);
CREATE TABLE dbo.table2
(
ENID BIGINT IDENTITY(1,1) NOT NULL,
EID VARCHAR(50) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
CID VARCHAR(350) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
CDate DATETIME NOT NULL DEFAULT(getdate()),
id BIGINT NOT NULL,
CONSTRAINT PK_ENID PRIMARY KEY (ENID ASC, EID ASC),
);
-- table1
CREATE INDEX ix_Cdate on dbo.table1 (Cdate) WITH (FILLFACTOR=100);
CREATE CLUSTERED INDEX ix_Recid on dbo.table1 (Recid) WITH (FILLFACTOR=80);
-- table2
CREATE INDEX ix_ENID_id on dbo.table2 (ENID,id) WITH (FILLFACTOR=100);
Changes
Changes I made and some numbers:
Added hint
OPTION (QUERYTRACEON 4138)
- avg. execution 7 secs down from original 50secs, but app team seems not have access to perform this in code. Need to check further on this.OPTION (ORDER GROUP)
gave same results of avg. 50secs so no improvement there.Added index as suggested:
CREATE INDEX i ON dbo.table2 (id) INCLUDE (CID, CDate);
Not much improvements there. Avg. 45 secs and plan was similar to one attached in this question (top plan).
Before and after each test I made sure plan was not generated from previous cached plan.
Fast plan
Attaching the plan which is faster and without any change in data or query is still fast for same amount of rows in both tables. App team continuously submit above query throughout the day to finish the batch by completing those TOP 100. There is a plan change based on some tipping number and below is how the good plan looks:
Edit:- With everything unchanged, no code change or any index being added, as suggested when i am trying to add hint (FORCESEEK) its giving me below error
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.
Solution
You have three main problems:
- There is no useful index to support the join on
id
. - The
TOP (100)
introduces a row goal, so estimations may be too low. - The
UPDATE
is non-deterministic.
Multiple rows from table2 could match on id
, so it is not clear which matching row from table2 should be used to provide values for the OUTPUT
clause. The aggregate is there to group on table2 id
and choose ANY
matching values for the other columns. The aggregate is a Flow Distinct because of the row goal.
One needs to be very careful with ANY
aggregates in non-deterministic UPDATE
statements because you may get incorrect results.
There is not enough detail in the question to make high quality recommendations, but:
- Add an index like
CREATE INDEX i ON dbo.table2 (id) INCLUDE (CID, CDate);
- Use
OPTION (QUERYTRACEON 4138)
to disable the row goal, orOPTION (ORDER GROUP)
to use a Stream Aggregate instead of Hash. - How you fix the non-deterministic
UPDATE
depends on the data relationships. The key point is to identify at most one row from the source that matches each target row. Typically, this will involve a unique index or constraint, or usingROW_NUMBER
orTOP (1)
.
Step 2 may or may not be necessary. I add it for completeness.
You may find it easier to visualize the issues and tune the query by writing it in this form:
UPDATE TOP (100)
xyz WITH (UPDLOCK, READPAST)
SET xyz.flag = 1
OUTPUT
inserted.Rcode, inserted.EDR, inserted.id,
abc.EID, abc.CID, abc.ENID, abc.Cdate
FROM dbo.table1 AS xyz
CROSS APPLY
(
-- At most one source row per target row
SELECT TOP (1)
abc.*
FROM dbo.table2 AS abc
WHERE
abc.id = xyz.id
-- ORDER BY something to choose the one row
) AS abc
WHERE
xyz.flag = 0;
Execution plan:
I probably wouldn't bother with a filtered index on table1, but if you did want to try it, this appears to be suitable:
CREATE INDEX i
ON dbo.table1 (Recid)
INCLUDE (id, flag)
WHERE flag = 0;
If you want to continue with the update syntax given in the question without addressing all the underlying issues properly, you may find this is faster:
UPDATE TOP (100) xyz
SET xyz.flag = 1
OUTPUT inserted.Rcode, inserted.EDR, inserted.id, abc.EID,abc.CID,abc.ENID,abc.Cdate
FROM dbo.table1 xyz WITH (UPDLOCK, READPAST)
INNER JOIN dbo.table2 abc WITH (NOLOCK, FORCESEEK)
on xyz.id=abc.id
WHERE xyz.flag = 0;