WHERE Clause Compare VarChar(n) Column = VarChar(MAX) Variable Results in Inequality Comparison and Poor Query Performance

https://dba.stackexchange.com/questions/281057

11-03-2021
|

Question

Earlier today I was working on a Reporting Stored Procedure which had poor performance. I was able to fix the issue by changing a static variable which held a string of ten characters or less from VARCHAR(MAX) to VARCHAR(10). I am looking to understand why that would have made a difference.

Additional Details If Needed

The Original query is much bigger than what needs to be looked at but I saw the below things happening, when joining the foundational table to a Key-Value Pair table we were returning a ton more rows than was actually needed (in some instances over 14 billion rows for a final output of ~60,000 rows) and we had a few extra operations (particularly the Lazy Table Spool). Image of that part of the Query Plan is below:

I could see a weird part of the execution plan was how the @ProgramID variable was being handled. This variable is used as [Column] = @ProgramID and the Column is part of the tables Non-Clustered Index and is a VarChar(10). But we are still doing an Inequality Search instead of an Equality Search.

After modifying this parameter/variable from VarChar(Max) to VarChar(10). The execution time dropped from several hours to about 30 seconds and the plan had a very different structure. Looking at a similar snippit from the execution plan, we were not returning nearly as many rows, we don't have the extra operations and we were doing an equality search on [Column] = @ProgramID (again Column is part of the tables Non-Clustered Index and is a VarChar(10)).

Below is a simplified query which shows some of this behavior (we don't get all the extra steps, but the Seek Predicate difference is still there): (Paste The Plan)

DECLARE @CompanyID VarChar(10) = 'RxCRoads'
        ,@ClientID VarCHar(10) = 'Amgen'
        ,@ProgramID VarChar(MAX) = 'Foundation'
        ,@StartDate DATETIME = '2020-01-01'
        ,@EndDate DATETIME = '2020-12-07'
        ,@RxOnly INT = 9

SELECT PC.CompanyID,
PC.ClientID,
PC.ProgramID,
PC.PatientID,
PC.CaseID,
RxOnly.[Value] AS RxOnly
FROM PATIENTCASES PC
    LEFT OUTER JOIN PatientCaseDetail RxOnly
        ON RxOnly.CompanyId = PC.CompanyID
        AND RxOnly.ClientId = PC.ClientID
        AND RxOnly.ProgramId = PC.ProgramID
        AND RxOnly.PatientID = PC.PatientID
        AND RxOnly.CaseID = PC.CaseID
        AND RxOnly.PatientCaseAdditionalElementId = @RxOnly
WHERE PC.CompanyID = @CompanyID
AND PC.ClientID = @ClientID
AND PC.ProgramID = @ProgramID
AND PC.CaseCreateDateTime >= @StartDate
AND PC.CaseCreateDateTime < @EndDate

Just changing the @ProgramID VarChar(MAX) = 'Foundation') to @ProgramID VarChar(10) = 'Foundation' returns a more streamline plan and better performance (even though not really noticeable in this particular query). (Paste The Plan)

Is there any particular reason why this change in the size of the data type would result in the change in the execution plan? Running the same query with @ProgarmId VarChar(8000) instead of @Program VarChar(10) has no noticeable difference from @Program VarChar(10). (Paste The Plan)

My guess is that VarChar(MAX) has different comparison rules due to its size, and this is just a state of what VarChar(MAX) is. But I didn't know if anyone had a better/more technical answer than I did so I can better apply this lesson in the future.

Solution

The Problems

You have two things working against you:

Local variables often have weird side effects
MAX datatypes can't be pushed, so you end up with late filters in your query plans

Some further reading:

Yet Another Post About Local Variables
MAX Data Types Do WHAT?
Why You Shouldn’t Ignore Filter Operators In Query Plans Part 1 | Part 2
Why does SQL Server use a better execution plan when I inline the variable?

In Particular

In your worse plan, there are two filter operators:

Each one is there to deal with the MAX type on @ProgramId. The length of the parameter doesn't matter much until you hit the max type, so long as it's the correct type, e.g. varchar = varchar or nvarchar = nvarchar

You may also see GetRangeThroughMismatchedType or GetRangeThroughConvertthere to deal with a date \ datetime mismatch:

Cast to date is sargable but is it a good idea?

You may find some overall relief by adding a recompile hint, but it won't fix the issue you're having around the MAX variable.

OTHER TIPS

Length is one of the things used by the optimizer to determine the optimal execution plan.

When using MAX, If SQL Server wants to use an index, it has to convert equal your variable to the column.

We can see Compute Scalar operator on your execution plan. It is a lightweight operator but using Compute Scalar with Nested Loop will be worse performance. Because Nested Loop goes to Compute Scalar operation for every row come to.

<ComputeScalar>
  <DefinedValues>
    <DefinedValue>
      <ValueVector>
        <ColumnReference Column="Expr1003"/>
        <ColumnReference Column="Expr1004"/>
        <ColumnReference Column="Expr1002"/>
      </ValueVector>
      <ScalarOperator ScalarString="GetRangeWithMismatchedTypes([@ProgramID],[@ProgramID],(62))">
        <Intrinsic FunctionName="GetRangeWithMismatchedTypes">
          <ScalarOperator>
            <Identifier>
              <ColumnReference Column="@ProgramID"/>
            </Identifier>
          </ScalarOperator>
          <ScalarOperator>
            <Identifier>
              <ColumnReference Column="@ProgramID"/>
            </Identifier>
          </ScalarOperator>
          <ScalarOperator>
            <Const ConstValue="(62)"/>
          </ScalarOperator>
        </Intrinsic>
      </ScalarOperator>
    </DefinedValue>
  </DefinedValues>
  <RelOp AvgRowSize="0" EstimateCPU="0" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1" LogicalOp="Constant Scan" NodeId="4" Parallel="false" PhysicalOp="Constant Scan" EstimatedTotalSubtreeCost="0">
    <OutputList/>
    <RunTimeInformation>
      <RunTimeCountersPerThread Thread="0" ActualRows="1" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="0" ActualCPUms="0"/>
    </RunTimeInformation>
    <ConstantScan/>
  </RelOp>
</ComputeScalar>

VARCHAR(N) and VARCHAR(MAX) kinda two different data types like DateTime and Date. VARCHAR(MAX) is BLOB and VARCHAR(n) In-Row data. So, if your column type does not VARCHAR(MAX), SQL Server needs to convert your variable for using the index.

There is a deep dive blog post from Paul White. But, doesn't include your example.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange