Should I increase MaxBufferedEpochs in MySQL Cluster - (Disconnecting node because it has exceeded MaxBufferedEpochs)

https://stackoverflow.com/questions/22634601

20-06-2023
|

Question

I am using MySQL cluster to handle the parallel processing of a large data stream which occasionally requires big joins (millions of rows) and update/deletes/inserts of multiple (1000s) of rows at a time. It appears to be handling this quite comfortably most of the time.

However I occasionally run into a problem which seems to be caused by maxing out the allowed number of buffered epochs on the SQL nodes causing the node to disconnect. I cannot seem to find a pattern connecting this error with the type of queries being executed at the time of error.

I then get the following exception from mysql

SQLSTATE[HY000]: General error: 1297 Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

and then hundreds of these errors from subsequent queries

SQLSTATE[HY000]: General error: 1296 Got error 157 'Unknown error code' from NDBCLUSTER

The MySQL Cluster log reports the following

INFO     -- Node 7: Disconnecting lagging nodes '0000000000000000000000000000000000000000000000000000000001000000',
WARNING  -- Node 7: Disconnecting node 24 because it has exceeded MaxBufferedEpochs (150 > 150), epoch 1758675/1
INFO     -- Node 7: Communication to Node 24 closed
INFO     -- Node 4: Communication to Node 24 closed
ALERT    -- Node 7: Node 24 Disconnected
...

According to the MySQL docs "The default value of 100 is sufficient for most normal operations". I increased the value of MaxBufferedEpochs to 150 but this has not helped.

I have seen people using configurations with very high numbers for MaxBufferedEpochs (10000) but this seems very high.

Can anybody explain why I might be getting this error and whether or not it is acceptable to increase this value by so many orders of magnitude? Also, is there anything else I could/should change in order to stop this from happening?

Solution

Have you increased your lock wait timeout value? If you have long running inserts/updates/selects and you have increased the lock wait timeout, the epoch buffers can fill up while waiting for these to exit. You should decrease the value and try to optimise your queries to not exceed the lock wait timeout.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow