If a rowstore index is stored in a b-tree data structure, what kind of data structure is used for a columnstore index? [duplicate]

https://dba.stackexchange.com/questions/268839

03-03-2021
|

Question

My understanding of how regular rowstore indexes are stored is in a type of b-tree data structure, but I was wondering because of the different nature of columnstore indexes, what type of underlying data structure is used?

Solution

We usually think of an index as a way of rapidly finding all references to a given piece of information. In a textbook given a topic we can find all pages that mention that topic. In a database given a column value we can find all rows that have that value.

Calling a columnstore an "index" in this sense is a bit of a misnomer. It is not intended to provide fast lookup for a specific column value's rows. Rather it is intended to quickly provide results of aggregates over large datasets. As such it is a data format in itself. The columnar data store, sometimes called the decompositional storage model (DSM), is well-known and venerable. Many proprietary and open-source DBMS offer it.

The basic idea of a column store is that values for a column are stored contiguously on disk. Then an aggregate on that column can efficiently pull that column, and only that column, from disk reducing the number of disk blocks that must be read. Further, since all the data values in a block are from the same domain compression on that block is likely to be very efficient, further reducing the number of blocks required. Compression can be much better than a generic ZIP - dictionaries, run-length and delta encodings can be used.

For various reasons it is better to break up rows into large groups before compressing and storing them. SQL Server calls these groups "row groups", which are made up of "column segments". Each row group holds just over one million rows. When the segments are constructed the server extracts some metadata such as min, max and (perhaps) sum for that column. This metadata is sometimes called zonemaps. At run time these zonemaps can be compared to the query predicates and, if the predicate does not lie between the segment's min and max values, processing can entirely avoid reading those blocks from disk. This is called segment elimination and is analogous to partition elimination. Moreover some queries can be answered directly from the zonemaps e.g. "select min(column) from table" can be answered by comparing the "min" segment summary values from the zonemaps.

I would note that B-Trees are not the only structure for indexing rowstores, it's just it is the only one available to database developers using SQL Server.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange