Tracking (un)successful autovacuums in postgres

https://dba.stackexchange.com/questions/285615

16-03-2021
|

Question

I am trying to help a team of junior, senior, principal and chief (mostly JEE) developers to be more data-centric and data-aware. In some cases we look into the data-processing costs, complexity of the algorithms, predictability of the results and statistical robustness of the estimates for query plans. In other cases we blindly believe that use of indexes is always great and scan of the tables is always bad. Sometimes we just opportunistically throw gazillions of insert, update, delete queries into the DB and hope for the best. We run load tests afterwards and we notice that our tables and indexes are bloated beyond imagination, the tables became pretty much unmanageable in size and chaos rules the area.

A good way to proceed is to train and learn complexity classes, understand the costs, have the right attitude. This change is very fruitful but hard and slow. As long as I am breathing, I'll continue this journey.

For now we are trying to understand why autovacuum for some tables kicks in so seldomly. We've got a postgres server (v9.5 I believe) running in azure cloud (test environment). We pay for 10K IOPS, and we use them fully (we write to the DB like hell). In the last 24 hours I see that autovacuum was run only 2 times for two large tables through

select * from pg_stat_all_tables order by last_autovacuum desc

In order to trigger an autovacuum, I created:

create table a(a int)
 
ALTER TABLE a SET (autovacuum_vacuum_scale_factor  = 0.0 );
ALTER TABLE a SET (autovacuum_vacuum_threshold     = 10  );
ALTER TABLE a SET (autovacuum_analyze_scale_factor = 0.0 );
ALTER TABLE a SET (autovacuum_analyz_threshold     = 10  );

and ran the following two statements multiple times:

delete from a;
insert into a (a) select generate_series(1,10);

This should have triggered an autovacuum on the table, but pg_stat_all_tables has NULL for last_autovacuum column for table a.

We also set log_autovacuum_min_duration to a very low value (like 250ms or even 0), but the only two entries in the logs are:

postgresql-2021-02-18_010000.log:2021-02-18 01:56:29 UTC-602a2e9c.284-LOG:  automatic vacuum of table "asc_rs.pg_toast.pg_toast_3760410": index scans: 1
postgresql-2021-02-18_060000.log:2021-02-18 06:35:47 UTC-602a2e9c.284-LOG:  automatic vacuum of table "asc_rs.pg_toast.pg_toast_3112937": index scans: 1

Our settings are:

We have a feeling that autovacuum is killed on large tables because of row-locks. Can we log this information in any way? Can we also log (failing) autovacuum attempts? How does postgres decide to start an autovacuum job (or more generally speaking tradeoffs between regular changes in the DB vs. maintenance jobs) on a very high load system? If parameters for kicking off of autovacuum are meat, would it definitely be kicked off or wait until the load of IOs decrease?

EDIT:

We do not see any errors/breaks w.r.t. the autovacuums in the logs (thanks Laurenz Albe)

DETAIL:  while scanning block 5756 of relation "laurenz.vacme"

No correct solution

OTHER TIPS

First figure out what PostgreSQL version you have. If it is truly 9.5, upgrade before you do anything else. I mean it. The rest of my answer deals with recent PostgreSQL versions.

A row lock won't block autovacuum progress (unless it is an anti-wraparound vacuum), but a table lock can. These are logged if you set log_autovacuum_min_duration to something else than -1:

LOG:  skipping vacuum of "vacme" --- lock not available

ERROR:  canceling autovacuum task
DETAIL:  while scanning block 5756 of relation "laurenz.vacme"

Keep all autovacuum settings at default values, except autovacuum_vacuum_cost_delay, which you set to 2ms.

Now why is your autovacuum not running? Some ideas:

You didn't commit the transaction that deleted the rows.
The autovacuum workers are doing something else. Check pg_stat_activity.
Check for oddities in the log file.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange