What does it mean that a “clustered index does not store pointers to the data”? [duplicate]

https://dba.stackexchange.com/questions/275721

08-03-2021
|

Question

I am a beginner in SQL and Database Administration and I am trying to undestand the difference between clustered and non-clustered indexes. More or less I understood what the different tasks are but I have some issues when it comes to the question of "pointers". I don´t come from CS so I have some issue to imagine what happens at the "leaf level" with the pointers and in which way they connect the index to the data. So, what does it mean that a clustered index does not store pointers to the data, while, the non-clustered one, does it?

Sorry in advance if these questions sound stupid but at this point I am pretty confused!

Solution

No, that's not a bad question.

In MS SQL Server (and others that work similarly), a clustered index doesn't store pointers to the data because it doesn't need to, a clustered index actually contains the table data itself.

A clustered index is the table, just ordered in a particular way.

If you create and populate a brand new table with no clustered index (called a "heap"), then add a non-clustered index, you're creating a new (ordered) object that still has to point to the rows in the actual table.

But if you take that heap and add a clustered index, you are literally re-ordering the table itself.

OTHER TIPS

A heap is just a bunch of pages with rows on it, there isn't really any information about what is where aper form SQL Server keeping track of "that table is represented by this list of pages". To find anything, it has to scan the whole lot (unless you use TOP 1 or TOP <n> or OFFSET/FETCH in which case it might get away with scanning until enough rows is found).

An index is a tree structure that makes finding things a lot faster than scanning everything. I'll not going into detail on index/tree structures (see blow for a useful reference though) but using an index means you can find a specific thing in just a few page reads instead of many thousands. Trees have a root node at the top, leaf nodes at the bottom layer, and branch nodes in between.

A non-clustered index on a heap in SQL Server has in its leaf nodes references to where the data is in the DB: the file, page, and row number within that page. So once you get to the bottom of the tree you still have on page read to do, potentially one per found row if there are many matching.

A clustered index in SQL Server (the term means something a little different in postgres and elsewhere) contains the actual row data in the leaf nodes: no extra lookups required. This is why clustered indexes are particularly useful for range queries: you could find the data for several, maybe a few hundred of, rows that match the filter all on the same page so no further lookups to do.

A non-clustered index on a table with a clustered index is very similar to one on a heap. The difference is that instead of the leaf nodes containing pointers to where the data is in the heap they contain the clustering key for the row so the extra data can be found via the clustered index (unless all you need is contained in the indexed values, the INCLUDED values, and the clustering key, in which case no further lookups are needed).

It may seem like a non-clustered index on a clustered table could be less efficient than on a non-clustered one, but the search in the clustered index is only going to be a page or few and there are other factors that can make the heap lookup less efficient than reading the one page that is pointed to by the index, and heaps can be inefficient in other ways too.

See the official docs at https://docs.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide#Clustered for more detail and a couple of diagrams. Or if you have a little time to learn with play, try Brent Ozar's How To Think Like SQL Server.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange