autovacuum challenges after upgrade to postgresql 9.6

https://dba.stackexchange.com/questions/180060

08-10-2020
|

Question

We have a cluster with several dozen databases. It was running great on 9.3 for a year (growing over time) when we did the upgrade to 9.6 last month. Since that time, the database will freeze on updates of large tables (tables with 100k+ rows), from between a minute, to 30-60 minutes.

Specifically, one of our processes updates several columns one-at-a-time in such a table, and in the code log, I can see where it pauses and hangs, for up to 30 minutes, and then picks up later as if nothing happened, and updates many consecutive columns in the same with no such hang, within seconds.

Increasing logging in the database showed nothing at first - and analysis of memory usage on the server and in postgres showed nothing unusual for resource usage. Eventually, I was able to see, with debug-level logging, that postgresql would go through vacuuming all databases, and the query in question would un-stick after the autovacuumer would hit that database, though not on the first loop of all databases, but after several passes. Since then, I have been messing with the autovacuum settings. I tried turning autovacuuming off, as it was off on 9.3 and somehow we survived - but that did not help. So I turned it on, and have been changing various settings as I learn about them. I got it so that the production server does not hang in this way through frequent autovacuuming, adding 16 workers, etc.

I am pasting the settings I have in general for our db server - the general memory settings were created with pgtune, the autovacuum settings I have been essentially tinkering with.

I am learning but am very much shooting in the dark - if anyone can provide guidance on how best to handle the analysis (through logs, etc) and optimizing of the autovacuuming on this instance of 9.6, that is what I am after.

Various queries at the official site and here designed to show locks don't turn up any waiting processes at all - but I do see at least one each entry for a process with an 'AccessShareLock' and 'RowExclusiveLock' in the pg admin iv dashboard, under Locks, when the query to update the column is running.

Again, what is frustrating - I don't see what is so complicated about a column update of a few hundred thousand records, one-at-a-time - it should be easy to see what is causing the hang (something about autovacuuming), and also how to fix it. Whereas the settings below seem to be working on production, I am not clear on the why, don't know how to improve them, and don't know how to create a working set of configurations on a dev server. Your help is appreciated.

autovacuum = on       
log_autovacuum_min_duration = 200            
autovacuum_max_workers = 16      
autovacuum_naptime = 1min               
autovacuum_vacuum_threshold = 5000     

#autovacuum_analyze_threshold = 500   
#autovacuum_vacuum_scale_factor = 0.2   
#autovacuum_analyze_scale_factor = 0.1  
#autovacuum_freeze_max_age = 200000000                                          
#autovacuum_multixact_freeze_max_age = 

autovacuum_vacuum_cost_delay = 20ms     
autovacuum_vacuum_cost_limit = 2000    
default_statistics_target = 100
maintenance_work_mem = 2920MB
checkpoint_completion_target = 0.8
effective_cache_size = 22GB
work_mem = 260MB
wal_buffers = 16MB
shared_buffers = 7680MB
min_wal_size = 80MB
max_wal_size = 2GB

Solution

I'd suggest you change your strategy.

Specifically, one of our processes updates several columns one-at-a-time in such a table, and in the code log, I can see where it pauses and hangs, for up to 30 minutes, and then picks up later as if nothing happened, and updates many consecutive columns in the same with no such hang, within seconds.

One such update can easily double the size of your table, since each UPDATE is basically equivalent to one DELETE one INSERT, which leaves each deleted row occupying useless space until the table it is VACUUMed and that space can be reused for the next round of UPDATEs.

You can do:

Change the strategy and update as many columns as you can at once. Your process will be much quicker because the data will be accessed many less times.
Issue a VACUUM from your code after each UPDATE round, so that the unused spaced is available for the next UPDATE round.
Under some circumstances, doing the UPDATEs in batches of (let's say) 1.000, and committing after each batch might ease the sitution.

Any combination of the three techniques will probably ease the updates.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange