MySQL slave not able to execute long running update query

https://dba.stackexchange.com/questions/157755

04-10-2020
|

Pergunta

I am running a Master-slave replication servers of MySQL. As part of clean up of old data, i ran a delete query which deletes huge number of records from the database. It ran fine on Master server but on slave server, it is giving me following error:

Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

The slave server machine is less powerful than master one. How can I get past this?

The query is single line delete query. I am running MySQL 5.6.

Solução

You need to stop replication, make the Slave have the same specs as the Master, then start replication.

Make sure the Slave has no incoming connections. Otherwise, that will make the SQL thread on the Slave compete with incoming connections that are running SELECT queries against the same table you are running DELETE.

If you cannot reroute the incoming connections, you will have to rerun the DELETE in chunks (perhaps 5000 rows at a time) on the Slave locally.

As a last resort, rebuild the Slave (after scaling up the Slave's hardware and configs).

Outras dicas

There are multiple questions; I'll try to cover most of them.

Replication, until recently, was single-threaded. (Even now, it has limitations on multi-threadedness.) So, it was easy for a Master to do lots of things in parallel, but once the were sent to the Slave and run serially, the Slave would get behind. This is even worse if the Slave has slower hardware than the Master.

A single-line delete that deletes, say, a million rows might be run with SBR or RBR, depending on your configuration. The details are noticeably different:

SBR (Statement Based Replication): After the Master finishes the delete, the statement is quickly replicated. Replication (assuming single-threaded) hangs until the Slave can delete all million rows. This takes time. All subsequent replication commands will sit and wait; the slave gets "behind".

RBR (Row Based Replication): After the Master finishes the delete, a million one-line records are pumped through the network to the Slave. This adds overhead. But the slave can (probably) perform the deletes faster because of the simplicity of the stream. Still, replication will be tied up for a non-trivial amount of time, during which the Slave will be "behind".

No amount of hardware can prevent getting "behind".

Meanwhile, any SELECTs on the Slave that are hitting that table are somewhat impacted, and vice versa. That is, the SELECTs may slow down the delete.

Big Deletes are a common problem. My blog describes several solutions. It includes details on how to do the "chunking" that Rolando suggested.

If you "chunk" and make each chunk its own transaction, there is less impact on both the Master and the Slave. The drawback is that a crash could leave the table with some chunks deleted, some not. (Rerunning the chunking delete will probably be simple and safe.)

I assume the size (hence, length of time) of the delete lead to the error message.

Note that one of the suggestions in my blog is... If "most" of the table is being deleted, instead copy over the rows to keep, then do a rename to swap tables. A lot faster, etc.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a dba.stackexchange