Why does using Format vs Right to apply padding cause estimated number of rows to dramatically change?
Question
I was working on a query at work, that had a left join like
cast(cola as varchar) + '-' + right('000' + cast(colb as varchar), 3) = x
The actual execution plan for this query was fairly close, est 269 vs actually 475.
Changing the right +padding to use format(colb, '000') results in a a huge mis-estimate of the number of rows, off by at least 4 million, which causes the query to take 10-15 times as long.
I understand why a mis-estimate would cause a problem, but I don't understand why using Format would cause a less accurate estimate.
Solution
There are a few things going on here:
- Decreased performance (due to greatly increased logic reads)
- Decreased accuracy of estimated number of rows
The factors involved are:
- Indexed
VARCHAR
column, with a SQL Server Collation, compared toNVARCHAR
data (Please note: this scenario is Collation-type specific: if the Collation was a Windows Collation then there would be no discernable degredation of performance. For details, please see "Impact on Indexes When Mixing VARCHAR and NVARCHAR Types" ) - Non-deterministic built-in or SQLCLR function, OR T-SQL UDF (whether marked as deterministic or not)
- Fragmented index (I had thought that it was merely stale stats that were the issue, but updating stats alone, even using
WITH FULLSCAN
has no effect)
The three factors noted above have been confirmed via testing (see below). Two of the three factors are easy enough to correct for:
- Convert
NVARCHAR
values toVARCHAR
if wanting to make use of an indexedVARCHAR
column, OR change the column's Collation to be a Windows Collation. - Do a full
REBUILD
of the index(es). Doing anALTER INDEX ... REORGANIZE;
orUPDATE STATISTICS ... WITH FULLSCAN;
by themselves does not seem to help (at least in terms of estimated row counts). - (optionally) Consider if a deterministic alternative is available (e.g. if
CASE / CONVERT
+RIGHT
is more efficient thanFORMAT
, AND produces the same result, then by all means useCASE / CONVERT
+RIGHT
;FORMAT
can do some nifty things, but is unneccessary for left-padding).
Also keep in mind priorities. While having accurate estimated row counts is ideal, if they are close enough you will be fine. Meaning, don't feel the need to do extra work to get super-accurate estimated row counts if doing so won't give any real performance gain (especially since, depending on the level of fragmentation, the non-deterministic function sometimes has a more accurate row estimate!). On the other hand, changing the datatype (of the value being compared) or Collation is worth the effort as that will have a noticable positive impact. Then, doing a REBUILD
of the index will get you close enough on the estimated row counts.
Testing Method
I tested this by populated a local temporary table with 5 million rows of the "name" column from sys.all_objects
(and using a Collation of SQL_Latin1_General_CP1_CI_AS
), then creating a non-clustered index on the string column, and then adding another 100k rows to fragment the index.
I filtered on a VARCHAR
literal, and then the same string literal but prefixed with an upper-case "N" to make it NVARCHAR
. This isolated the issue of comparison value datatype.
I then filtered on the same literal value, but wrapped in a call to FORMAT
. This isolated the issue of non-deterministic functions.
To confirm the behavioral effect of function determinism, I created two SQLCLR functions that did nothing more than return the passed-in values, but one is deterministic and the other is not. This makes it clear that the issue is determinism and not anything else happening with the function. I used SQLCLR because there does not seem to be an equivalent way of doing this in T-SQL. Even if the function is marked in the system as being deterministic (by creating the UDF using WITH SCHEMABINDING
), the behavior will mirror that of non-deterministic functions (I did test this but did not include it below).
I used SET STATISTICS IO, TIME ON;
, and checked the "Include Actual Execution Plan" option in SSMS.
After running the first set of tests, I executed:
EXEC (N'USE [tempdb]; UPDATE STATISTICS #Objects [IX_#Objects_Name] WITH FULLSCAN;');
and re-ran the tests. Minimal improvement on logical reads, and no change to estimated number of rows.
I then executed:
ALTER INDEX ALL ON #Objects REORGANIZE;
and re-ran the tests. No change to estimated number of rows.
I then executed:
ALTER INDEX ALL ON #Objects REBUILD;
and finally saw an improvement on both logical reads and estimated number of rows.
Then, I dropped the table, recreated it using Latin1_General_100_CI_AS_SC
as the Collation, and re-ran the tests as described above.
Test Code
SQLCLR Code
The following code was used to create two scalar functions that do exactly the same thing: simply return the value passed-in. The only difference between the two functions is that one is marked as IsDeterministic = true
and the other is marked as IsDeterministic = false
.
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public class ScalarFunctions
{
[return: SqlFacet(MaxSize = 4000)]
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString PassThrough_Deterministic(
[SqlFacet(MaxSize = 4000)] SqlString TheString)
{
return TheString;
}
[return: SqlFacet(MaxSize = 4000)]
[SqlFunction(IsDeterministic = false, IsPrecise = true)]
public static SqlString PassThrough_NonDeterministic(
[SqlFacet(MaxSize = 4000)] SqlString TheString)
{
return TheString;
}
}
Test Setup
-- DROP TABLE #Objects;
CREATE TABLE #Objects
(
[ObjectID] INT IDENTITY(1, 1) NOT NULL PRIMARY KEY,
[Name] VARCHAR(128) COLLATE SQL_Latin1_General_CP1_CI_AS
-- Latin1_General_100_CI_AS_SC
);
-- Insert 5 million rows:
INSERT INTO #Objects ([Name])
SELECT TOP (5000000) ao.[name]
FROM [master].[sys].[all_objects] ao
CROSS JOIN [master].[sys].[all_columns] ac;
-- Create the index:
CREATE INDEX [IX_#Objects_Name] ON #Objects ([Name]) WITH (FILLFACTOR = 100);
-- Insert another 100k rows to fragment index and reduce accuracy of the statistics:
INSERT INTO #Objects ([Name])
SELECT TOP (100000) ao.[name]
FROM master.sys.all_objects ao
CROSS JOIN master.sys.all_columns ac;
The Tests (and results)
Results key:
- Set "(A)" = SQL Server Collation (
SQL_Latin1_General_CP1_CI_AS
) - Set "(B)" = Windows Collation (
Latin1_General_100_CI_AS_SC
) - Each results comment: { before
REBUILD
} / { afterREBUILD
} - "CS + CS" = Compute Scalar + Constant Scan
SET STATISTICS IO, TIME ON;
-- Total rows matching filter criteria: 2203
SELECT [ObjectID] FROM #Objects WHERE [Name] = 'objects';
-- (A) logical reads 13 (est. rows: 2125.67) / 9 (2203.15) Index Seek
-- (B) logical reads 13 (est. rows: 2019.74) / 9 (2203.25) Index Seek
SELECT [ObjectID] FROM #Objects WHERE [Name] = N'objects';
-- (A) logical reads 25159 (est. rows: 2125.67) / 23158 (2203.15) Index SCAN
-- (B) logical reads 13 (est. rows: 2019.74) / 9 (2203.25) Index Seek + CS + CS
SELECT [ObjectID] FROM #Objects WHERE [Name] = FORMAT(0, N'objects');
-- (A) logical reads 25159 (est. rows: 2433.23) / 23158 (2406.8) Index SCAN
-- (B) logical reads 13 (est. rows: 2307.69) / 9 (2208.75) Index Seek + CS + CS
SELECT [ObjectID] FROM #Objects WHERE [Name] =
dbo.PassThrough_Deterministic(N'objects');
-- (A) logical reads 25159 (est. rows: 2125.67) / 23158 (2203.15) Index SCAN
-- (B) logical reads 13 (est. rows: 2019.74) / 9 (2203.25) Index Seek + CS + CS
SELECT [ObjectID] FROM #Objects WHERE [Name] =
dbo.PassThrough_NonDeterministic(N'objects');
-- (A) logical reads 25159 (est. rows: 2433.23) / 23158 (2406.8) Index SCAN
-- (B) logical reads 13 (est. rows: 2307.69) / 9 (2208.75) Index Seek + CS + CS
SET STATISTICS IO, TIME OFF;
EXEC (N'USE [tempdb]; UPDATE STATISTICS #Objects [IX_#Objects_Name] WITH FULLSCAN;');
-- re-run tests
ALTER INDEX ALL ON #Objects REORGANIZE;
-- re-run tests
ALTER INDEX ALL ON #Objects REBUILD;
-- re-run tests
Second Variation
- DROP table
- re-create table using Windows Collation
- re-run all tests in "The Tests" section above
OTHER TIPS
FORMAT
returns nvarchar
which has a higher data type precedence than the compared varchar column. In addition the imprecise row count estimate, the implicit conversion of the compared varchar
column to nvarchar
will prevent indexes on that column from be used efficiently
Try casting the FORMAT
result to varchar
.
The remarks section for FORMAT (Transact-SQL) says
The FORMAT function is nondeterministic.
Therefore, the query planner is confused about what to expect as result from this function. Maybe the non-deterministic behavior even keeps it from applying some optimizations, like caching intermediate results.