MySQL 8.0.20 - Master Replica scheme, increasing delay between Source and Replica

https://dba.stackexchange.com/questions/285442

16-03-2021
|

题

We have a Source Database of about 1TB, and a Replica on another server (similar architecture). We have setup Master Slave replication and delay between Master and Slave is always growing since. Master server is "hyghly" used by cron which insert/update a lot of data, but after some researches, I have to admit I'm a bit stuck.

Could you help me diagnose the cause of the problem? Many thanks for any hint

Master / Slave server specs: 24 cores / 256 GB DDR4 / NVMe

htop says CPU usage: 6% / RAM usage: ~50% / IO: up: ~0,23 Mb/s, down: ~3,5 Mb/s

MySQL variables : https://justpaste.it/mysql_variables

Mysql > show replica status \G result:

             Replica_IO_State: Waiting for master to send event  <- most of the times
                  Source_Host: XX.XX.XX.XX (Source private IP)
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: binlog.001720
          Read_Source_Log_Pos: 1024908194                        <- increasing normally
               Relay_Log_File: Linux2-relay-bin.000296
                Relay_Log_Pos: 915186810                         <- increasing normally
        Relay_Source_Log_File: binlog.001676
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 915186601                         <- incrising normally
              Relay_Log_Space: 48272449293                       <- incrising normally
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File:
           Source_SSL_CA_Path:
              Source_SSL_Cert:
            Source_SSL_Cipher:
               Source_SSL_Key:
        Seconds_Behind_Source: 56693                             <- increasing slowly
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Source_Server_Id: 1
                  Source_UUID: ece58e3c-5ac0-11eb-ab4b-00505601ddac
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: waiting for handler commit        <- most of the times 
           Source_Retry_Count: 86400
                  Source_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Source_SSL_Crl:
           Source_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name:
           Source_TLS_Version:
       Source_public_key_path:
        Get_Source_public_key: 0
            Network_Namespace:

解决方案

Please remember that MySQL Replication is single threaded by nature.

It has two threads

I/O Thread transmission of binlog events from the Master into the Slave's relay logs
SQL Thread processes binlog events collected in the Slave's relay logs in the order received

From the looks of the variables, I would recommend just two temporary changes:

SET GLOBAL sync_binlog = 0;
SET GLOBAL innodb_flush_log_at_trx_commit = 2;

This will loosen up the writes done by the SQL threads because you have binary logging on the Slave. Every transaction written on the Slave has to be logged and the binlogs flushed. Setting sync_binlog to 0 lets the OS cache/flush rather than mysqld.

You also want to loosen the strictness of ACID compliance on the Slave by setting innodb_flush_log_at_trx_commit to 2. This will cause the InnoDB storage engine with more regularity rather than flushing on each transaction commit.

You will have to set innodb_flush_method to O_DIRECT to further assign flushes of pages to mysqld rather than the OS.

I have mentioned these things before

Dec 07, 2012 : Dynamic change to innodb_flush_log_at_trx_commit
Feb 10, 2012 : Is it safe to use innodb_flush_log_at_trx_commit = 2
May 04, 2011 : Clarification on MySQL innodb_flush_method variable

Since innodb_flush_method is not dynamic, you must restart mysqld.

Please add the following variable to your my.cnf

[mysqld]
innodb_flush_method = O_DIRECT

Then, restart mysqld. After that login to mysql and run

SET GLOBAL sync_binlog = 0;
SET GLOBAL innodb_flush_log_at_trx_commit = 2;
START SLAVE; /* If replication does not start automatically */

Replication should start catching up. Once caught up, change the variables back with

SET GLOBAL sync_binlog = 1;
SET GLOBAL innodb_flush_log_at_trx_commit = 1;

UPDATE 2021-02-16 10:12 EST

Why would one think of using these variables on a temporary basis ?

Over the years, I have had clients that would have one of the following setups:

sync_binlog=0 and innodb_flush_log_at_trx_commit=2 : This would be on read-only slaves
sync_binlog=1 and innodb_flush_log_at_trx_commit=1 : This would be for slaves you would want to become a Master should you wish to failover.

SCENARIO : Suppose you had a Master and 3 Slaves. You have the Master with sync_binlog=1 and innodb_flush_log_at_trx_commit=1. The Slave would have sync_binlog=0 and innodb_flush_log_at_trx_commit=2.

If you wish to have a failover, you would set up one of the slaves that is fully caught up (Seconds_Behind_Master=0) to have sync_binlog=1 and innodb_flush_log_at_trx_commit=1. Then point applications to the that new Slave and setup the other Slaves to replicate from the newly promoted Master.

You can use things like ProxySQL / Orchestrator to set up such steps for you.

This just one example of why these variables would be changed on a temporary basis.

CAPTAIN'S LOG : SUPPLEMENTAL

If you have the redo logs (ib_logfile0,ib_logfile1), binary logs, slow logs, or generals logs stored in the same data volume as the data, they can slow down writes to the database. How so ???

At the disk level

All Logs are written sequentially
All data is usually written in random order

Storing logs on a separate disk volume can also speed up write performance. I learned this from a Facebook Engineer's Blog. (See also How do I determine how much data is being written per day through insert, update and delete operations? and MySQL on SSD - what are the disadvantages?)

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange