Mongodb consumes too much RAM during replica resync

https://dba.stackexchange.com/questions/233292

26-01-2021
|

Question

I have a mongodb cluster with 2 data nodes and 1 arbiter. I use mongodb 4.0.7 for all my centos VMs.

A few days ago one of my servers (lets call it data-2) had a fatal crash and requested a complete resync of the data. After restarting mongodb on data-2 the resync started. However immediately after that the RAM usage on data-1 (primary) started to skyrock.

However on data-2 the memory consumption was near constant:

During normal usage time the memory consumption stays close to that of data-2 during the resync.

After a few hours the worst case scenario happened and the last remaining data holder (data-1) got OOM killed by the kernel after using the entire ram + swap (~50GB). I could restore data-1 as primary with little effort, but I every time I try to start the resync the same happens again.

This behavior seems to be unrelated to actual database usage during the resync. (Taking a prod db down for a resync is definitely a no go).

Total data size ~500GB, biggest database 480GB with ~300 collections.

Now my questions:

What uses that much memory during the resync?
How do I effectively analyze the memory consumption?
How do I prevent the primary from going down because of the resync?

Solution

What uses that much memory during the resync?

There is so many things uses the memory during the resync time, As Mr. SpiXel has clearly defined here The WiredTiger cache settings only controls the size of memory directly used by the WiredTiger storage engine (not the total memory used by mongod). Many other things are potentially taking memory in a MongoDB/WiredTiger configuration, such as the following:

WiredTiger compresses disk storage, but the data in memory are uncompressed.
WiredTiger by default does not fsync the data on each commit
WiredTiger keeps multiple versions of records in its cache
WiredTiger Keeps checksums of the data in cache
MongoDB itself consumes memory to handle open connections, aggregations, serverside code and etc.

The maximum size of the internal cache that WiredTiger will use for all data.

Changed in version 3.4: Values can range from 256MB to 10TB and can be a float. In addition, the default value has also changed.

Starting in 3.4, the WiredTiger internal cache, by default, will use the larger of either:

50% of (RAM - 1 GB), or
256 MB.

For example, on a system with a total of 4GB of RAM the WiredTiger cache will use 1.5GB of RAM (0.5 * (4 GB - 1 GB) = 1.5 GB). Conversely, a system with a total of 1.25 GB of RAM will allocate 256 MB to the WiredTiger cache because that is more than half of the total RAM minus one gigabyte (0.5 * (1.25 GB - 1 GB) = 128 MB < 256 MB).

How do I effectively analyze the memory consumption?

As per MongoDB blog here some of the by @CASEY DUNHAM, MongoDB performance is a huge topic encompassing many areas of system activity. By default, MongoDB will reserve 50 percent of the available memory for the WiredTiger data cache. The size of this cache is important to ensure WiredTiger performs adequately. It’s worth taking a look to see if you should alter it from the default. A good rule of thumb is that the size of the cache should be big enough to hold the entire application working set.

How do we know whether to alter it? Look at the cache usage statistics:

> db.serverStatus().wiredTiger.cache
{
    "tracked dirty bytes in the cache" : <num>,
    "tracked bytes belonging to internal pages in the cache" : <num>,
    "bytes currently in the cache" : <num>,
    "tracked bytes belonging to leaf pages in the cache" : <num>,
    "maximum bytes configured" : <num>,
    "tracked bytes belonging to overflow pages in the cache" : <num>,
    "bytes read into cache" : <num>,
    "bytes written from cache" : <num>,
    "pages evicted by application threads" : <num>,
    "checkpoint blocked page eviction" : <num>,
    "unmodified pages evicted" : <num>,
    "page split during eviction deepened the tree" : <num>,
    "modified pages evicted" : <num>,
    "pages selected for eviction unable to be evicted" : <num>,
    "pages evicted because they exceeded the in-memory maximum" : <num>,
    "pages evicted because they had chains of deleted items" : <num>,
    "failed eviction of pages that exceeded the in-memory maximum" : <num>,
    "hazard pointer blocked page eviction" : <num>,
    "internal pages evicted" : <num>,
    "maximum page size at eviction" : <num>,
    "eviction server candidate queue empty when topping up" : <num>,
    "eviction server candidate queue not empty when topping up" : <num>,
    "eviction server evicting pages" : <num>,
    "eviction server populating queue, but not evicting pages" : <num>,
    "eviction server unable to reach eviction goal" : <num>,
    "internal pages split during eviction" : <num>,
    "leaf pages split during eviction" : <num>,
    "pages walked for eviction" : <num>,
    "eviction worker thread evicting pages" : <num>,
    "in-memory page splits" : <num>,
    "in-memory page passed criteria to be split" : <num>,
    "lookaside table insert calls" : <num>,
    "lookaside table remove calls" : <num>,
    "percentage overhead" : <num>,
    "tracked dirty pages in the cache" : <num>,
    "pages currently held in the cache" : <num>,
    "pages read into cache" : <num>,
    "pages read into cache requiring lookaside entries" : <num>,
    "pages written from cache" : <num>,
    "page written requiring lookaside records" : <num>,
    "pages written requiring in-memory restoration" : <num>
}

There’s a lot of data here, but we can focus on the following fields:

wiredTiger.cache.maximum bytes configured: This is the maximum cache size.
wiredTiger.cache.bytes currently in the cache – This is the size of the data currently in the cache. This should not be greater than the maximum bytes configured.
wiredTiger.cache.tracked dirty bytes in the cache – This is the size of the dirty data in the cache. This value should be less than the bytes currently in the cache value.

Looking at these values, we can determine if we need to up the size of the cache for our instance. Additionally, we can look at the wiredTiger.cache.bytes read into cache value for read-heavy applications. If this value is consistently high, increasing the cache size may improve overall read performance.

How do I prevent the primary from going down because of the resync?

As per MongoDB blog documentation here There might be a situation when primarily becomes inaccessible. When a primary does not communicate with the other members of the set for more than 10 seconds, an eligible secondary will hold an election to elect itself the new primary. The first secondary to hold an election and receive a majority of the members’ votes becomes primary. Although the timing varies, the failover process generally completes within a minute. For instance, it may take 10-30 seconds for the members of a replica set to declare a primary inaccessible. One of the remaining secondaries holds an election to elect itself as a new primary. The election itself may take another 10-30 seconds. While an election is in the process, the replica set has no primary and cannot accept writes and all remaining members become read-only.

For further your ref here , here and here

OTHER TIPS

So whenever the secondary member becomes stale and falls so far behind the primary oplog and oplog entries overwritten, then mongod will completely resync the stale member by removing its data and performing initial sync.

As per MongoDB documentation,

When you perform an initial sync, MongoDB:

Clones all databases except the local database. To clone, the mongod scans every collection in each source database and inserts all data into its own copies of these collections.

Changed in version 3.4: Initial sync builds all collection indexes as the documents are copied for each collection. In earlier versions of MongoDB, only the _id indexes are built during this stage.

Changed in version 3.4: Initial sync pulls newly added oplog records during the data copy. Ensure that the target member has enough disk space in the local database to temporarily store these oplog records for the duration of this data copy stage.

Applies all changes to the data set. Using the oplog from the source, the mongod updates its data set to reflect the current state of the replica set.

When the initial sync finishes, the member transitions from STARTUP2 to SECONDARY.

In your case, the primary needs to cone the 300 collections it is over used. Also mongodb suggests the following options for resync.

MongoDB provides two options for performing an initial sync:

Restart the mongod with an empty data directory and let MongoDB’s normal initial syncing feature restore the data. This is the more simple option but may take longer to replace the data.

See Automatically Sync a Member.

Restart the machine with a copy of a recent data directory from another member in the replica set. This procedure can replace the data more quickly but requires more manual steps.

See Sync by Copying Data Files from Another Member.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange