Mysql master-master: single table out of sync

https://dba.stackexchange.com/questions/215900

08-01-2021
|

Question

I have a master-master MySQL 5.7, everything works fine except for one error which is rising from time to time.

If I run a show slave status\G on server2 I have an error on a table:

...
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table my_database_name.my_table_name; Can't find record in 'my_table_name', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000120, end_log_pos 83145706
...

I can temporarily skip the error by running:

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

or by setting

slave-skip-errors = 1032
skip-slave-start

in server2's my.cnf and then restarting MySQL.

However, I would like to fix it permanently so I checked in server1's binary logs and I found this:

### UPDATE `my_database_name`.`my_table_name`
### WHERE
###   @1=140
###   @2='2014:02:02'
###   @3=2878
###   @4=3253
###   @5=''
###   @6=0.00
###   @7=0
###   @8=35.75
###   @9=0.00
###   @10=0
###   @11=0
### SET
###   @1=140
###   @2='2014:02:02'
###   @3=2878
###   @4=3254
###   @5=''
###   @6=0.00
###   @7=0
###   @8=35.75
###   @9=0.00
###   @10=0
###   @11=0
# at 83145706

I see it tries to update the table by setting the value 3254 in the 4th column where the 4th column itself has value 3253. Anyway, if I look at the table on the servers, it has a different value on both:

server1:

mysql> select * from my_table_name where my_table_name.id = 140;
+-----+------------+----------+----------+---------+-----------+---------+---------+-----------+---------------------+---------------------+
| id  | col2       | col3     | col4     | col5    | col6      | col7    | col8    | col9      | col10               | col11               |
+-----+------------+----------+----------+---------+-----------+---------+---------+-----------+---------------------+---------------------+
| 140 | 2014-02-02 |     2878 |     3254 |         |      0.00 |       0 |   35.75 |      0.00 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |
+-----+------------+----------+----------+---------+-----------+---------+---------+-----------+---------------------+---------------------+

server2:

 mysql> select * from my_table_name where my_table_name.id = 140;
+-----+------------+----------+----------+---------+-----------+---------+---------+-----------+---------------------+---------------------+
| id  | col2       | col3     | col4     | col5    | col6      | col7    | col8    | col9      | col10               | col11               |
+-----+------------+----------+----------+---------+-----------+---------+---------+-----------+---------------------+---------------------+
| 140 | 2014-02-02 |     2878 |     3257 |         |      0.00 |       0 |   35.75 |      0.00 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |
+-----+------------+----------+----------+---------+-----------+---------+---------+-----------+---------------------+---------------------+

Considering it's master-master, how can I align the tables? I think I can't simply set the value 3253 hoping it's going to be updated by the binary log, can I?

If possible, I want to avoid re-syncing the whole database because it's really huge.

Thank you!

Solution 3

I think I've found a solution.

Let's say the correct value resides on server2, all I have to do is read it (for example, note that col4 should have value 3254) and update it on server1 only.

To do this, I can use SET sql_log_bin = 0; which limits the change to the local server without propagating to the binary log and thus to the slave:

SET sql_log_bin = 0;
UPDATE my_table_name SET col4 = 3258 WHERE my_table_name.id = 140;
SET sql_log_bin = 1;

The tables are now aligned and the issue shouldn't happen anymore.

OTHER TIPS

You could use pt-table-checksum and pt-table-sync to perform an optimized synchronisation while online and it will backoff to avoid replication lag.

Assuming you set replication up correctly at the start, this kind of error suggests that you are using master/master in an unintended/unsupported way. Master/master does not support concurrent updates of the same row by clients both servers, because it (intentionally) does not resolve conflicts -- and that's a conflict.

Query on A changes value from 2 to 3
Query on B changes the same value from 2 to 4
A sees replication event from B to change 2 to 4, impossible because value is already 3, not 2
B sees replication event from A to change 2 to 3, also impossible because value is already 4, not 2

Skipping errors just ignores the problem.

This is a documented limitation.

What issues should I be aware of when setting up two-way replication?

MySQL replication currently does not support any locking protocol between master and slave to guarantee the atomicity of a distributed (cross-server) update. In other words, it is possible for client A to make an update to co-master 1, and in the meantime, before it propagates to co-master 2, client B could make an update to co-master 2 that makes the update of client A work differently than it did on co-master 1. Thus, when the update of client A makes it to co-master 2, it produces tables that are different from what you have on co-master 1, even after all the updates from co-master 2 have also propagated. This means that you should not chain two servers together in a two-way replication relationship unless you are sure that your updates can safely happen in any order, or unless you take care of mis-ordered updates somehow in the client code.

https://dev.mysql.com/doc/refman/5.7/en/faqs-replication.html#faq-replication-how-two-way-problems

There is no performance advantage to actually writing to two masters at the same time, since both masters must still handle all writes, so your solution may involve sending all writes at any given time to a single master (both machines are still actually masters, but only one is treated as such at any point in time by your application), or you may need to reconsider why you are using circular/ring replication and whether a better option might be a Galera cluster.

This issue can't be "fixed" if you are using the servers in an unintended way, because the behavior is expected.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange