How to organize primary keys for good locality?

https://stackoverflow.com/questions/19695553

02-07-2022
|

Domanda

I have a table for users and a table for documents. Documents have exactly one user as an owner, and for the application I'm building, I know that I will typically be accessing a group of documents associated with a single given user.

Let's say the average user has K documents, and certain common queries fetch all of the documents for a given user. I don't want my database (PostgreSQL) to have to do K disk seeks (on average) to fetch all the documents for a user. Ideally, the documents would be stored in contiguous blocks so that fetches would only require a few seeks.

Is it possible (and reasonable) to organize the document table schema to create such locality? I know that no-SQL implementations do this all the time? E.g. the BigTable paper talks about how row keys for web tables are assigned by URL, except that the url is reversed, e.g. com.cnn.www, so that all the pages for CNN are located near eachother in the data store. It doesn't appear possible to something similar in Postgres because the tables cannot be index-organized, although it might be possible in MySQL w/ InnoDB. This post comes to a similar conclusion.

Soluzione

The command you're looking for is CLUSTER, but it has drawbacks. It completely rewrites the table when you run it, which requires a lock on it, so you may only want to do this when traffic is low. Also, Postgres will do nothing to keep rows in that order during INSERTs and UPDATEs, so your data will tend to fragment as the table is written to and you may have to recluster it regularly.

What you can also do is set a low fillfactor on the table, so that UPDATEs are more likely to keep a given row on the same page. This should prevent some fragmentation, which just leaves INSERTs, but with a low fillfactor INSERTs will tend to be placed on newer pages, and these will probably be commonly accessed enough to be kept in RAM. I'm making assumptions about your usage patterns which may be wrong, but regardless, your best course of action is probably to just recluster whenever you see I/O start to become a problem.

Finally, there's also a tool called pg_repack that can cluster a table without taking such a heavy lock, in a similar manner to how CREATE INDEX CONCURRENTLY works, but it's a third-party tool, so you'll want to experiment with it before running in production.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow