Why does using Format vs Right to apply padding cause estimated number of rows to dramatically change?

https://dba.stackexchange.com/questions/214098

07-01-2021
|

Question

I was working on a query at work, that had a left join like

cast(cola as varchar) + '-' + right('000' + cast(colb as varchar), 3) = x

The actual execution plan for this query was fairly close, est 269 vs actually 475.

Changing the right +padding to use format(colb, '000') results in a a huge mis-estimate of the number of rows, off by at least 4 million, which causes the query to take 10-15 times as long.

I understand why a mis-estimate would cause a problem, but I don't understand why using Format would cause a less accurate estimate.

Solution

There are a few things going on here:

Decreased performance (due to greatly increased logic reads)
Decreased accuracy of estimated number of rows

The factors involved are:

Indexed VARCHAR column, with a SQL Server Collation, compared to NVARCHAR data (Please note: this scenario is Collation-type specific: if the Collation was a Windows Collation then there would be no discernable degredation of performance. For details, please see "Impact on Indexes When Mixing VARCHAR and NVARCHAR Types" )
Non-deterministic built-in or SQLCLR function, OR T-SQL UDF (whether marked as deterministic or not)
Fragmented index (I had thought that it was merely stale stats that were the issue, but updating stats alone, even using WITH FULLSCAN has no effect)

The three factors noted above have been confirmed via testing (see below). Two of the three factors are easy enough to correct for:

Convert NVARCHAR values to VARCHAR if wanting to make use of an indexed VARCHAR column, OR change the column's Collation to be a Windows Collation.
Do a full REBUILD of the index(es). Doing an ALTER INDEX ... REORGANIZE; or UPDATE STATISTICS ... WITH FULLSCAN; by themselves does not seem to help (at least in terms of estimated row counts).
(optionally) Consider if a deterministic alternative is available (e.g. if CASE / CONVERT + RIGHT is more efficient than FORMAT, AND produces the same result, then by all means use CASE / CONVERT + RIGHT ; FORMAT can do some nifty things, but is unneccessary for left-padding).

Also keep in mind priorities. While having accurate estimated row counts is ideal, if they are close enough you will be fine. Meaning, don't feel the need to do extra work to get super-accurate estimated row counts if doing so won't give any real performance gain (especially since, depending on the level of fragmentation, the non-deterministic function sometimes has a more accurate row estimate!). On the other hand, changing the datatype (of the value being compared) or Collation is worth the effort as that will have a noticable positive impact. Then, doing a REBUILD of the index will get you close enough on the estimated row counts.

Testing Method

I tested this by populated a local temporary table with 5 million rows of the "name" column from sys.all_objects (and using a Collation of SQL_Latin1_General_CP1_CI_AS), then creating a non-clustered index on the string column, and then adding another 100k rows to fragment the index.

I filtered on a VARCHAR literal, and then the same string literal but prefixed with an upper-case "N" to make it NVARCHAR. This isolated the issue of comparison value datatype.

I then filtered on the same literal value, but wrapped in a call to FORMAT. This isolated the issue of non-deterministic functions.

To confirm the behavioral effect of function determinism, I created two SQLCLR functions that did nothing more than return the passed-in values, but one is deterministic and the other is not. This makes it clear that the issue is determinism and not anything else happening with the function. I used SQLCLR because there does not seem to be an equivalent way of doing this in T-SQL. Even if the function is marked in the system as being deterministic (by creating the UDF using WITH SCHEMABINDING), the behavior will mirror that of non-deterministic functions (I did test this but did not include it below).

I used SET STATISTICS IO, TIME ON;, and checked the "Include Actual Execution Plan" option in SSMS.

After running the first set of tests, I executed:

EXEC (N'USE [tempdb]; UPDATE STATISTICS #Objects [IX_#Objects_Name] WITH FULLSCAN;');

and re-ran the tests. Minimal improvement on logical reads, and no change to estimated number of rows.

I then executed:

ALTER INDEX ALL ON #Objects REORGANIZE;

and re-ran the tests. No change to estimated number of rows.

I then executed:

ALTER INDEX ALL ON #Objects REBUILD;

and finally saw an improvement on both logical reads and estimated number of rows.

Then, I dropped the table, recreated it using Latin1_General_100_CI_AS_SC as the Collation, and re-ran the tests as described above.

Test Code

SQLCLR Code

The following code was used to create two scalar functions that do exactly the same thing: simply return the value passed-in. The only difference between the two functions is that one is marked as IsDeterministic = true and the other is marked as IsDeterministic = false.

using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public class ScalarFunctions
{
    [return: SqlFacet(MaxSize = 4000)]
    [SqlFunction(IsDeterministic = true, IsPrecise = true)]
    public static SqlString PassThrough_Deterministic(
             [SqlFacet(MaxSize = 4000)] SqlString TheString)
    {
        return TheString;
    }

    [return: SqlFacet(MaxSize = 4000)]
    [SqlFunction(IsDeterministic = false, IsPrecise = true)]
    public static SqlString PassThrough_NonDeterministic(
             [SqlFacet(MaxSize = 4000)] SqlString TheString)
    {
        return TheString;
    }
}

Test Setup

-- DROP TABLE #Objects;
CREATE TABLE #Objects
(
  [ObjectID] INT IDENTITY(1, 1) NOT NULL PRIMARY KEY,
  [Name] VARCHAR(128) COLLATE SQL_Latin1_General_CP1_CI_AS
                               -- Latin1_General_100_CI_AS_SC
);


-- Insert 5 million rows:
INSERT INTO #Objects ([Name])
  SELECT TOP (5000000) ao.[name]
  FROM   [master].[sys].[all_objects] ao
  CROSS JOIN [master].[sys].[all_columns] ac;

-- Create the index:
CREATE INDEX [IX_#Objects_Name] ON #Objects ([Name]) WITH (FILLFACTOR = 100);

-- Insert another 100k rows to fragment index and reduce accuracy of the statistics:
INSERT INTO #Objects ([Name])
  SELECT TOP (100000) ao.[name]
  FROM master.sys.all_objects ao
  CROSS JOIN master.sys.all_columns ac;

The Tests (and results)

Results key:

Set "(A)" = SQL Server Collation ( SQL_Latin1_General_CP1_CI_AS )
Set "(B)" = Windows Collation ( Latin1_General_100_CI_AS_SC )
Each results comment: { before REBUILD } / { after REBUILD }
"CS + CS" = Compute Scalar + Constant Scan

SET STATISTICS IO, TIME ON;
-- Total rows matching filter criteria: 2203

SELECT [ObjectID] FROM #Objects WHERE [Name] = 'objects';
-- (A) logical reads 13 (est. rows: 2125.67) / 9 (2203.15) Index Seek
-- (B) logical reads 13 (est. rows: 2019.74) / 9 (2203.25) Index Seek

SELECT [ObjectID] FROM #Objects WHERE [Name] = N'objects';
-- (A) logical reads 25159 (est. rows: 2125.67) / 23158 (2203.15) Index SCAN
-- (B) logical reads 13 (est. rows: 2019.74) / 9 (2203.25) Index Seek + CS + CS


SELECT [ObjectID] FROM #Objects WHERE [Name] = FORMAT(0, N'objects');
-- (A) logical reads 25159 (est. rows: 2433.23) / 23158 (2406.8) Index SCAN
-- (B) logical reads 13 (est. rows: 2307.69) / 9 (2208.75) Index Seek + CS + CS


SELECT [ObjectID] FROM #Objects WHERE [Name] =
                                      dbo.PassThrough_Deterministic(N'objects');
-- (A) logical reads 25159 (est. rows: 2125.67) / 23158 (2203.15) Index SCAN
-- (B) logical reads 13 (est. rows: 2019.74) / 9 (2203.25) Index Seek + CS + CS

SELECT [ObjectID] FROM #Objects WHERE [Name] =
                                      dbo.PassThrough_NonDeterministic(N'objects');
-- (A) logical reads 25159 (est. rows: 2433.23) / 23158 (2406.8) Index SCAN
-- (B) logical reads 13 (est. rows: 2307.69) / 9 (2208.75) Index Seek + CS + CS


SET STATISTICS IO, TIME OFF;


EXEC (N'USE [tempdb]; UPDATE STATISTICS #Objects [IX_#Objects_Name] WITH FULLSCAN;');

-- re-run tests

ALTER INDEX ALL ON #Objects REORGANIZE;

-- re-run tests

ALTER INDEX ALL ON #Objects REBUILD;

-- re-run tests

Second Variation

DROP table
re-create table using Windows Collation
re-run all tests in "The Tests" section above

OTHER TIPS

FORMAT returns nvarchar which has a higher data type precedence than the compared varchar column. In addition the imprecise row count estimate, the implicit conversion of the compared varchar column to nvarchar will prevent indexes on that column from be used efficiently

Try casting the FORMAT result to varchar.

The remarks section for FORMAT (Transact-SQL) says

The FORMAT function is nondeterministic.

Therefore, the query planner is confused about what to expect as result from this function. Maybe the non-deterministic behavior even keeps it from applying some optimizations, like caching intermediate results.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange