Question

I often hear that partitioning huge tables should significantly increase query speed and insert/update speed, because indexes are proportionally smaller.

I'm struggling to understand why that should be the case.

In my understanding operations on an index should be of log(N) complexity, so even if we partition a huge table into 100 smaller items then we just partitioned a single index into 100 smaller ones.

If queries frequently access items from the entire dataset, then these 100 indexes will compete for processor cache anyway, so I don't see why database will be hitting the disk less often.

If we are usually only requesting some portion of the data - for example if we have a table with timestamped data and are more often interested in latest data - then the B-tree node count that we have to traverse will be 1 or 2 less. But in un-partitioned table scenario top-most nodes of B-tree will be cached anyway, so the fact that B-tree is slightly higher shouldn't have significant impact on performance.

Variation on previous scenario: if we are commonly accessing ranges of data that are on the same "axis" as the partition key - again, if we have a table paritioned by timestamp and run reports on date ranges - then partitions will provide natural clustering for the data, thus reducing disk accesses and improving query performance. But I don't see how this will improve insert/update performance? And why should it improve performance on queries that do not filter by partition key?

I realize that partitions are useful in some other cases (dropping old data, lock contention with many users, backups, aforementioned clustering, etc) - but I commonly hear that query or update performance should increase as well, hence this question.

I'm most familiar with PostgreSQL, but I suspect the underlying concepts are similar between many relational databases.

Was it helpful?

Solution

Partitioning (at least from the perspective of SQL Server, but probably applies to most RDBMS) is not meant to be a tool for improving DQL query performance. E.g. your bread and butter SELECT statements won't see any impactful performance gains from partitioning, as you mentioned because of the Big O of B-Tree indexing. Rather it's meant to improve the performance (mostly be reducing contention) of DDL and some DML queries, for better database management.

I highly recommend this Brent Ozar article on partitioning (by Kendra Little) where she talks more theoretical implementation (as opposed to technical and specific SQL Server implementation).

One of the key points it mentions is partition switching which in 5 seconds is a methodology of partitioning on a key where your data is rolling (such as on a date field) and allows you to manage old partitions without causing locks or contention on the entire Table. By breaking up your Table into multiple partitions, you can archive your oldest partition without causing downtime on the rest of the Table that is active.

The same idea applies to other operations such as as index rebuilds / reorgs, and other index and table maintenance, by allowing you to only affect certain partitions at a time and reduce the overall contention and locking on the Table at any given time. You can more smartly manage your Tables and their indexes accordingly.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top