Postgres REINDEX time estimate

https://dba.stackexchange.com/questions/280760

11-03-2021
|

Question

I've got an older DB (postgres 10.15) that's not yet been upgraded. One problematic table had a few large indexes on it, some of which were corrupt and needed reindexing. Since it's not on version 12+, I can't concurrently reindex the table (which means I need to do it non-concurrently, which requires a table write lock) - so I wanted to know how I could do some rough calculations on how long the reindex would take so I can plan some maintenance. Most of my research ends up in the "just use pg_stat_progress_create_index! (which isn't available in 10), or people just saying to use CONCURRENTLY.

The table is ~200GB, and there are indexes are 7 indexes which are 14GB each (as per pg_relation_size). I can get ~900M/s constant read-rate on the DB for this task. Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

Solution

You could just create new index with different name by

create index concurrently index_new on ...

Then drop corrupted index with

drop index concurrently index_old;

Then you could rename new index to old name:

alter index index_new rename to index_old;

Latter will require lock, but for few milliseconds of runtime after acquire the lock. So you do not need downtime due to write lock.

The definition of the index can be obtained from the command pg_dump -s -t tablename --no-acl

This is exactly the same procedure that does reindex concurrently under the hood. But reindex concurrently is a bit cheaper since do not need lock for index rename phase.

Also widely known pg_repack has feature to reindex table with option --only-indexes. This option is implemented as create + drop index concurrently.

Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

Well, any index creation without concurrently will read the entire table sequentially (concurrently will read the table twice). Something else depends on access method. Btree will sort all live tuples. This is the most time-consumption part of create index, for large indexes the work will be done in temporary files (remember increase maintenance_work_mem). This part also depends on datatypes and values. Text with small selectivity (e.g. some status field) will be noticeable slower to build than integer sequences.

I have no way to estimate, except for one: to measure the creation time of an index on some data sample:

create table estimate_table as (
  select * from tablename 
  where created_at > '2020-01-01'
);
\dt+ estimate_table
\timing on
create index on estimate_table ...

Reindex is just a special form of index creation. Hmm, and an important point: reindex table has no difference with several reindex index in terms of resourse usage. reindex table is implemented by calling reindex_index for each individual index on table. So, table with 5 indexes will be scanned 5 times.

OTHER TIPS

The only reliable estimate of how long it will take can come from restoring a physical backup to an identical machine and testing it there.

There are too many factors going into this to come up with a good estimate otherwise.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange