MySQL 8.0.20 - Master Replica scheme, increasing delay between Source and Replica
-
16-03-2021 - |
题
We have a Source Database of about 1TB, and a Replica on another server (similar architecture). We have setup Master Slave replication and delay between Master and Slave is always growing since. Master server is "hyghly" used by cron which insert/update a lot of data, but after some researches, I have to admit I'm a bit stuck.
Could you help me diagnose the cause of the problem? Many thanks for any hint
Master / Slave server specs: 24 cores / 256 GB DDR4 / NVMe
htop says CPU usage: 6% / RAM usage: ~50% / IO: up: ~0,23 Mb/s, down: ~3,5 Mb/s
MySQL variables : https://justpaste.it/mysql_variables
Mysql > show replica status \G
result:
Replica_IO_State: Waiting for master to send event <- most of the times
Source_Host: XX.XX.XX.XX (Source private IP)
Source_User: repl
Source_Port: 3306
Connect_Retry: 60
Source_Log_File: binlog.001720
Read_Source_Log_Pos: 1024908194 <- increasing normally
Relay_Log_File: Linux2-relay-bin.000296
Relay_Log_Pos: 915186810 <- increasing normally
Relay_Source_Log_File: binlog.001676
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Source_Log_Pos: 915186601 <- incrising normally
Relay_Log_Space: 48272449293 <- incrising normally
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Source_SSL_Allowed: No
Source_SSL_CA_File:
Source_SSL_CA_Path:
Source_SSL_Cert:
Source_SSL_Cipher:
Source_SSL_Key:
Seconds_Behind_Source: 56693 <- increasing slowly
Source_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Source_Server_Id: 1
Source_UUID: ece58e3c-5ac0-11eb-ab4b-00505601ddac
Source_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Replica_SQL_Running_State: waiting for handler commit <- most of the times
Source_Retry_Count: 86400
Source_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Source_SSL_Crl:
Source_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Source_TLS_Version:
Source_public_key_path:
Get_Source_public_key: 0
Network_Namespace:
解决方案
Please remember that MySQL Replication is single threaded by nature.
It has two threads
- I/O Thread transmission of binlog events from the Master into the Slave's relay logs
- SQL Thread processes binlog events collected in the Slave's relay logs in the order received
From the looks of the variables, I would recommend just two temporary changes:
SET GLOBAL sync_binlog = 0;
SET GLOBAL innodb_flush_log_at_trx_commit = 2;
This will loosen up the writes done by the SQL threads because you have binary logging on the Slave. Every transaction written on the Slave has to be logged and the binlogs flushed. Setting sync_binlog to 0 lets the OS cache/flush rather than mysqld.
You also want to loosen the strictness of ACID compliance on the Slave by setting innodb_flush_log_at_trx_commit to 2. This will cause the InnoDB storage engine with more regularity rather than flushing on each transaction commit.
You will have to set innodb_flush_method to O_DIRECT
to further assign flushes of pages to mysqld rather than the OS.
I have mentioned these things before
Dec 07, 2012
: Dynamic change to innodb_flush_log_at_trx_commitFeb 10, 2012
: Is it safe to use innodb_flush_log_at_trx_commit = 2May 04, 2011
: Clarification on MySQL innodb_flush_method variable
Since innodb_flush_method is not dynamic, you must restart mysqld.
Please add the following variable to your my.cnf
[mysqld]
innodb_flush_method = O_DIRECT
Then, restart mysqld. After that login to mysql and run
SET GLOBAL sync_binlog = 0;
SET GLOBAL innodb_flush_log_at_trx_commit = 2;
START SLAVE; /* If replication does not start automatically */
Replication should start catching up. Once caught up, change the variables back with
SET GLOBAL sync_binlog = 1;
SET GLOBAL innodb_flush_log_at_trx_commit = 1;
UPDATE 2021-02-16 10:12 EST
Why would one think of using these variables on a temporary basis ?
Over the years, I have had clients that would have one of the following setups:
sync_binlog=0
andinnodb_flush_log_at_trx_commit=2
: This would be on read-only slavessync_binlog=1
andinnodb_flush_log_at_trx_commit=1
: This would be for slaves you would want to become a Master should you wish to failover.
SCENARIO : Suppose you had a Master and 3 Slaves. You have the Master with sync_binlog=1
and innodb_flush_log_at_trx_commit=1
. The Slave would have sync_binlog=0
and innodb_flush_log_at_trx_commit=2
.
If you wish to have a failover, you would set up one of the slaves that is fully caught up (Seconds_Behind_Master=0) to have sync_binlog=1
and innodb_flush_log_at_trx_commit=1
. Then point applications to the that new Slave and setup the other Slaves to replicate from the newly promoted Master.
You can use things like ProxySQL / Orchestrator to set up such steps for you.
This just one example of why these variables would be changed on a temporary basis.
CAPTAIN'S LOG : SUPPLEMENTAL
If you have the redo logs (ib_logfile0,ib_logfile1), binary logs, slow logs, or generals logs stored in the same data volume as the data, they can slow down writes to the database. How so ???
At the disk level
- All Logs are written sequentially
- All data is usually written in random order
Storing logs on a separate disk volume can also speed up write performance. I learned this from a Facebook Engineer's Blog. (See also How do I determine how much data is being written per day through insert, update and delete operations? and MySQL on SSD - what are the disadvantages?)