Dedicated database server heavy iowait spikes

https://stackoverflow.com/questions/11448682

20-06-2021
|

Вопрос

We have a dedicated database server that runs PostgreSQL 8.3 on linux debian. The database is being regularly queried for a lot of data while updates/inserts happen frequently also. Periodically the database does not respond for a small duration ( like 10 seconds ) and then it goes into normal execution flow again.

What I noticed through top is that there's an iowait spike during that time that lasts for as long as the database does not respond. At the same time pdflush gets activated. So my idea is that pdflush has to write data from the cached memory space back to the disk based on dirty page and background ratio. The rest of the time , when postgresql works normally there's no iowait happening since pdflush is not active. The values for my vm are the following:

 dirty_background_ratio = 5
 dirty_ratio = 10
 dirty_expire_centisecs = 3000

My meminfo :

MemTotal:     12403212 kB
MemFree:       1779684 kB
Buffers:        253284 kB
Cached:        9076132 kB
SwapCached:          0 kB
Active:        7298316 kB
Inactive:      2555240 kB
SwapTotal:     7815544 kB
SwapFree:      7814884 kB
Dirty:            1804 kB
Writeback:           0 kB
AnonPages:      495028 kB
Mapped:        3142164 kB
Slab:           280588 kB
SReclaimable:   265284 kB
SUnreclaim:      15304 kB
PageTables:     422980 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
WritebackTmp:        0 kB
CommitLimit:  14017148 kB
Committed_AS:  3890832 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    304188 kB
VmallocChunk: 34359433983 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:     2048 kB

I am thinking to tweak the duration at which a dirty page stays in the memory ( dirty_expire_centisecs) so as to divide the iowait spikes equally in time ( call pdflush more regularly so as to write smaller chunks of data to the disk ). Any other proposed solution ?

Решение

IO spikes are likely to happen when postgresql is checkpointing. You can verify that by logging checkpoints and see if they coincide with the lack of response of the server.

If that's the case, tuning checkpoints_segments and checkpoint_completion_target is likely to help. See the wiki's advice about that and the doc about the WAL configuration.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow