Question

I know majority of databases uses B-Trees, and I can see how using a balanced binary tree will give fast sort times, for ordering by ID or whatever else the primary key is; but how are databases able to ORDER_BY different fields like Name or Age, does it just perform an efficient sorting algorithm like merge sort or quick sort on the data or does it store sorted data in B-Trees for all fields (which seems really inefficient for storage). Because ID ordering and Name ordering would be different unless it stores all fields sorted in a B-Tree, it must perform some other sorting algorithm.

TLDR: How are databases able to perform fast sorting on non-primary key fields, if the data stored in B-Tree is based on the primary key.

Was it helpful?

Solution

The data in a table can be ordered (this is called a clustered index) but obviously you can only have one sort order. Further indices will not reorder the data, but also they do not contain the data, only the order of rows. If the database does not have an index for the desired ORDER BY clause, it will have to build one on the fly.

OTHER TIPS

In a typical RDBMS the row data lives in a separate area of storage called the heap. The B-tree is just an index pointing into the heap. It is straightforward to construct multiple indexes pointing into the same heap, on an arbitrarily determined key (you can even combine fields or apply functions to them to generate a key). There is a special-case where the row data is stored inside the B-tree, called a clustered index, but even there you can have a second B-tree pointing into the primary index.

In order to access the rows quickly multiple strategies are used. Some columns can be kept inside the index, which is very fast to query, and for the others a page cache can be used to keep the most commonly accessed parts of the heap in RAM. If you have enough RAM to keep the working set entirely in memory the database will perform well even without indexes. For larger working sets an SSD with low latency seeks can help a lot with performance.

The challenge that databases have is to keep all these structures in sync even across crashes. The typical approach is to use a write-ahead log (WAL). Writes are first appended to the log, and a memory cache ensures that reads are serviced from the log instead of the out-of-date B-trees. After the log is flushed to disk the B-trees are also updated. If the database crashes before it has finished updating the B-trees, on next startup it will recover from the WAL and finish the update.

If you want to learn more, you can follow along with the notes from this slide deck.

Noone here can tell you if any database works like this, but I am pretty sure in most cases this is quite simple:

  • For any field or combination of fields where an index was created beforehand, this index is utilized whenever the ORDER_BY clause contains the related fields. That index could be implemented as a B-tree.

  • For any other field (or combination of fields), a sorting algorithm is picked and the sorting is done on-the-fly. Which sorting algorithm is used depends most probably on the specific DBMS and how sophisticated it is implemented, but I can imagine systems which decide on the algorithm depending on some heuristics like the number of records to sort, size of the key fields and if it likely they fit into main memory or not.

Normally it will create an index (for example a B tree) for the primary key automatically and allow the database administrator to create aditional indices for other columns, combinations of columns or expressions over column values. As you point out this trades some space (and insertion speed) for retrival speed. So to get the best performance add enough indices but not more than needed. If no index is present the database will use some sorting algorithm to sort the data on the fly.

Typically, a database keeps the row data separate from the index.

Rows are stored in pages (see e.g. the details of how it's done in PostgreSQL), and for each indexed field (or field combination), there is a B-Tree index which does not contain the actual row data, it only contains a pointer to the page and an index within the page. There's actually nothing special about the primary key on the technical level, it's just another index.

Note that in this design, if you only have operations involving indexed fields (such as WHERE clauses, ORDER BY and joins), you don't even have to look at the actual rows until you have to output the result. The (page, index) tuple uniquely identifies the row, and you can perform all operations using only that, and then in the end you fetch all the row data in one go.

Licensed under: CC-BY-SA with attribution
scroll top