Improving statistics on a large “WHERE IN” SQL Server query

https://dba.stackexchange.com/questions/213268

07-01-2021
|

Question

Currently, I'm trying to run the following example query:

SELECT [DATA1], [DATA2] FROM TABLE WHERE
    [DIMENSION0] IN (1, 5, ... (possibly 10s of numbers)) AND
    [DIMENSION1] IN (5) AND
    [DIMENSION2] IN (10) AND
    [DIMENSION3] IN (48) AND
    [DIMENSION4] IN (1) AND
    [DIMENSION5] IN (1) AND
    [DIMENSION6] IN (8) AND
    [DIMENSION7] IN (1) AND
    [DIMENSION8] IN (52) AND
    [DIMENSION9] IN (1, 10, ... (possibly 100s of numbers)) AND
    [DIMENSION10] IN (1, 235, ... (possibly 1000s of numbers)) AND
    [DIMENSION11] IN (1)

The table looks like the following;

D = Dimension

[D0] [D1] [D2] [D3] [D4] [D5] [D6] [D7] [D8] [D9] [D10] [D11] [DATA1] [DATA2]

Which contains a clustered index along all the dimensions, and could contain millions of records.

When I run this query through SSMS, I get the following query plan:

Here, it wildly overestimates the number of records it is looking for, which I believe is the reason for it running so slowly.

I've updated the statistics, but that wasn't the problem, so I'm left with it being an issue with the query.

I've also been able to improve the speed of the query by forcing SQL to run in parallel by using:

OPTION(QUERYTRACEON 8649)

This produces the following execution plan:

This is faster, but it is still overestimating the number of rows.

I was hoping that someone might be able to help me understand why this estimation is so high, and how I could possibly reduce it.

Clustered index definition:

/****** Object:  Index [ClusteredIndex-20180726-083210]    Script Date: 
26/07/2018 09:47:58 ******/
CREATE UNIQUE CLUSTERED INDEX [ClusteredIndex-20180726-083210] ON 
[dbo].[TABLE]
(
[DIMENSION0] ASC,
[DIMENSION1] ASC,
[DIMENSION2] ASC,
[DIMENSION3] ASC,
[DIMENSION4] ASC,
[DIMENSION5] ASC,
[DIMENSION6] ASC,
[DIMENSION7] ASC,
[DIMENSION8] ASC,
[DIMENSION9] ASC,
[DIMENSION10] ASC,
[DIMENSION11] ASC,
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, 
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = 
ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

I perform insert, updates and deletes to the table.

I can't use a Clustered Columnstore Index, as the Data2 column is varbinary(max). I tried using a non-clustered version, but the query plan just used the clustered index.

I did update statistics with FULLSCAN. There are some Dimensions that are hit more than others. I have previously experimented with the order of the Dimensions in the Clustered Index, but it still overestimates the number of rows.

Full query with plan: https://www.brentozar.com/pastetheplan/?id=HyLHtXDVX

Solution

The way I see it - the problem is bad SQL, Terribly bad.

Starts with:

[DIMENSION3] IN (48) AND

Generate a Where, not an IN, if you ahve one element.

But worse:

[DIMENSION10] IN (1, 235, ... (possibly 1000s of numbers)) AND

There is no statistics on IN, so it works best with smallish selections. In this case, it is better to make a temp table WITH STATISTICS and load the values in there, then replace the IN with some sort of subselect. This way the query optimizer actually has an idea what he is facing (in terms of selectivity) and may decide to approach in a different way.

Otherwise there are 2 things for you ;)

Get decent hardware ;) Then possibly in memory tables?
Realize there is a reason for Analysis Service.

Queries like that really stretch SQL Server. While it made progress, this is exactly what Analysis Server cubes are made for.

OTHER TIPS

The issue may not be with improving statistics, but rather a problematic indexing strategy.

Indexing all (or many) columns is usually not effective. Indexes were intended to be built from a small number of high-cardinality columns using small data types (int, bigints, small varchars).

You may consider hashing any huge columns like the varbinary(max) as part of the insert or ETL process and then indexing them:

How to handle uniqueness of many large columns in SQL Server?

How many rows in your main table? What percent of those rows will be returned by your query?

I ask because if your query returns ~20% of the data then nonclustered indexes will probably be eliminated as a retrieval path because scanning NC indexes can result in hitting the same datapage at different times during the scan to retrieve multiple rows on that page.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange