Why is it faster to create index and run query than run query without it?

https://dba.stackexchange.com/questions/280887

11-03-2021
|

Question

I have table with few million rows. It contains logs from external service, so I've decided not to index it (lot of inserts, sparse reads).

When I run query that reads from the table without index, it takes (unsurprisingly) very long time.

However, when I create index and run the query and then drop the index, it is considerably faster (even with creating and dropping index).

Why is it faster to create index ad hoc, instead of letting SQL Server do it's thing? It seems unintuitive (Why wouldn't SQL Server create the index itself?). Are there any downsides to this approach?

The query in question looks something like this, but I do not think it is necessarily relevant, as I've seen similar behavior elsewhere as well.

    UPDATE Device
    SET Col1 = l.Col1
        ,Col2 = l.Col2
        ,Col3 = l.Col3
    FROM dbo.Device
        OUTER APPLY (
            SELECT MAX(Id) AS [Id]
            FROM dbo.Logs 
            WHERE Logs.Device_FK = Device.Id
            GROUP BY Logs.Device_FK
        ) lastLog
        OUTER APPLY (
            SELECT Col1, Col2, FORMAT(Col3) AS "Col3"
            FROM dbo.Logs
            WHERE Logs.Id = lastLog.Id
        ) l

Solution

There's not necessarily any downsides to your approach, it's completely dependent on how frequently you're writing to the table vs updating and reading from it, which is why SQL Server let's the developer choose what and when to index things instead of trying to guess what they want. (It doesn't know your future intentions with a particular table any better than you do.)

I'm sure you understand how indexing works so I won't go into too much detail, but generally speaking, indexing stores the data sorted in a B-Tree data structure. So it's very efficient when having to look up a specific set of data that is covered by that index.

Because of how B-Trees work and the algorithm used to building one with the indexed data, it's generally fastest to have all the data upfront and then index it into a B-Tree. When you already have an index (B-Tree) in place and new data is added or deleted, then that can cause additional "shuffling" events to reorganize the B-Tree which can be less efficient. (The "shuffling" that occurs makes me think of MergeSort algorithm by the way. That helps me visualize the difference of having all the data up front vs adding new data after it's been sorted already.)

Obviously usually it's not the case that Tables generally stay the same size with the exact same records, which is why more times than not it's recommended to create your indexes on the Table upfront and SQL Server does it's best to efficiently update the underlying B-Tree as changes occur in the Table (and it does a great job at it up to a certain point).

In certain cases though, yours maybe being one, if the Table has a very high frequency of INSERTs and DELETEs and low frequency of UPDATEs and SELECTs against it, then creating and dropping the indexes on that Table ad-hoc (just before and after reading from the Table) might make sense.

At the end of the day you'd have to test both ways to see what works best for your environment. The size of the Table doesn't matter so much when it comes to choosing when to index it, as it makes no difference to SQL Server if you want to store a billion records in a single index or ten records, but moreso it depends how frequently you're inserting and deleting from that Table vs updating and selecting from it. (E.g. if you insert 100,000 records every minute into that Table but only SELECT from it once a day then it may be better to create your index ad-hoc.)

OTHER TIPS

The reason is the you repeatedly scan the Logs table. And you even have two cross apply in that query. The repeated scanning of this table is evidently more expensive than building the index and then using that index.

Nothing strange or unexpected here.

SQL Server could possibly do an index spool so it can used that for each visit to the Logs table. Perhaps the optimizer evaluated that strategy and discarded it since its estimates showed that it wouldn't be beneficial (incorrectly, perhaps). One very first step would be to study the execution plan, the estimates compared to actual values and take it from there.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange