MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= 'ANONYMOUS';

dba.stackexchange https://dba.stackexchange.com/questions/156496

  •  04-10-2020
  •  | 
  •  

Pergunta

I recently installed two identical default installations of MySQL 5.7 under Ubuntu Server 16.04, and configured them to do a binary log replication. Until now this has been working fine, but suddenly the replication stops continuing and the slave query thread runs at 100% CPU without doing any work.

After some searching I found that the slave status tells that it is way behind master. Using mysqlbinlog on the binlog file indicated by Relay_Master_Log_File and position Exec_Master_Log_Pos, I found out that the statement executed at this position is:

SET @@SESSION.GTID_NEXT= 'ANONYMOUS';

Somehow the slave hangs when trying to execute this statement, sending the CPU load to 100% (which is how I discovered the situation in the first place).

Except by having the slave skip that statement using SET GLOBAL sql_slave_skip_counter=1 it is unclear to me what is the actual cause of this issue, and how I should solve this.

Any help would be really appreciated!

Foi útil?

Solução

TL;DR: This is probably caused by poor table design combined with ROW-based replication.

I just ran into this problem. I was asked to move an old database to a new server and set up replication.

I found that it's not actually the statement in the subject that causes the slave to hang (SET @@SESSION.GTID_NEXT= 'ANONYMOUS'). This statement is issued at the beginning of a transaction.

mysql> SHOW BINLOG EVENTS IN 'mysql-bin.000196' FROM 96754384 LIMIT 5000;
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
| Log_name         | Pos       | Event_type     | Server_id | End_log_pos | Info                                         |
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
| mysql-bin.000196 |  96754384 | Anonymous_Gtid |         1 |    96754449 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS'         |
| mysql-bin.000196 |  96754449 | Query          |         1 |    96754537 | BEGIN                                        |
| mysql-bin.000196 |  96754537 | Table_map      |         1 |    96754608 | table_id: 241 (db.bad_table)                 |
| mysql-bin.000196 |  96754608 | Delete_rows    |         1 |    96762805 | table_id: 241                                |
| mysql-bin.000196 |  96762805 | Delete_rows    |         1 |    96771002 | table_id: 241                                |
...
| mysql-bin.000196 | 106681175 | Delete_rows    |         1 |   106689372 | table_id: 241                                |
| mysql-bin.000196 | 106689372 | Delete_rows    |         1 |   106697569 | table_id: 241                                |
| mysql-bin.000196 | 106697569 | Delete_rows    |         1 |   106697626 | table_id: 241 flags: STMT_END_F              |
| mysql-bin.000196 | 106697626 | Xid            |         1 |   106697657 | COMMIT /* xid=28382600 */                    |
| mysql-bin.000196 | 106697657 | Rotate         |         1 |   106697704 | mysql-bin.000197;pos=4                       |
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
1219 rows in set (0.02 sec)

This table has 66 million rows. I found that it has no primary key or unique key. The query responsible for this uses an index scan on the master.

mysql> SHOW VARIABLES LIKE '%binlog_format%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW   |
+---------------+-------+
1 row in set (0.00 sec)

For the slave to replicate this with ROW-based replication, it needs to perform approximately 1200 full table scans on the slave. 1200 is probably a fairly small number here. It could be in the hundreds of thousands. The replication does actually work, but with this design, 'seconds_behind_master' will grow indefinitely.

I will add a primary key and partitioning to this table. I will also ask my colleagues to rewrite their code so bulk deletes are no longer necessary. This probably requires adding an additional column.

EDIT: I don't have enough points to comment on other posts, so I will add my comments here for now. I believe that issuing 'SET GLOBAL sql_slave_skip_counter = 1', as mentioned by others, will skip the entire transaction and lead to data inconsistencies. Correct me if I'm wrong.

A quick fix would be to change the binlog format to QUERY or MIXED. These formats can also lead to data inconsistencies, so I would recommend finding and fixing the root cause instead of changing the binlog format.

Outras dicas

I'm currently running into the same issue. RHEL6, Percona 5.7.15-9. It's a recurring problem, not a one-time thing, so skipping the statement is just a temporary fix. It'll be broken again soon in my case.

I did a bit more digging:

ps -eLo %cpu,pid,spid,cmd | grep <mysqld pid>

This got me the PID (really, thread ID) of all mysql threads, with the CPU usage of each. On my system, only one of them was relatively high (the slave SQL thread, I presume).

From that:

strace -s 128 -p <thread-pid>

This got me a continuous loop of read, pread, lseek calls. Each one has a file descriptor as its first argument... always the same one. So:

ls -l /proc/<thread-pid>/fd/<id-number-from-strace>

It's a symlink to the file being read. In my case, it was a random MyISAM .MYD file in a database. The contents of that table match the output of the strings from strace, though I couldn't tell this right away (read on).

I was unable to do anything with this table- "waiting for table level lock". I skipped the statement to get replication back up to date, and magically the table was usable again. I did a REPAIR, OPTIMIZE, and ANALYZE TABLE on it. None of them reported anything, but I'm hopeful that perhaps this will help anyway.

If not, I'll convert it to InnoDB.

I had same issue. Just skipped it, and all became good.

stop slave; set global sql_slave_skip_counter = 1; start slave;
Licenciado em: CC-BY-SA com atribuição
Não afiliado a dba.stackexchange
scroll top