Question

We have a HA RabbitMQ cluster (v3.2.x) with two nodes that sits behind a load-balancer. Our clients are configured to use a 300s heartbeat. Everything works as expected most of the time.

However, if the client's connection drops (say the client's NIC is disconnected), we have noticed (via TCPDump/wireshark) that the RabbitMQ node will attempt 3 heartbeat messages (in our case nearly 15 mins) before it closes the connection. Why? Why not close it after one failure?

Is there some means to change this behavior on the RabbitMQ server? Or do we have to shorten our heartbeat to something much smaller like 5s or 10s in order to get the connection to close sooner, thoughts?

Related issue...

Looking at the TCPDump (captured on load-balancer), I wonder why the LB doesn't close the connection when it doesn't receive the TCP-ACK from the dead client in response to the proxied RabbitMQ server heartbeat request? In fact, the LB will attempt to send the request several times (never receiving a response, of course). Wouldn't it make sense for the LB to make the assumption the connection has been dropped and close the entire session (including the connection to RabbitMQ node)?

Was it helpful?

Solution

It appears as though RabbitMQ is configured to tolerate two missed heartbeats before it terminates the connection. However, it waits until the next heartbeat would need to be sent before it drops the connection, that's what gives it the appearance of requiring 3 missed heartbeats.

Heartbeat1 (no response) wait Heartbeat2 (no response) wait Heartbeat3 terminate

There is a slight bug in MQ (it sends a 3rd heartbeat but immediately terminates the connection) but it isn't really affecting anything.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top