Rebus stops retrieving messages from RabbitMQ

https://stackoverflow.com/questions/20967437

25-09-2022
|

Question

We have an issue in our Rebus/RabbitMQ setup where Rebus suddenly stops retrieving/handling messages from RabbitMQ. This has happened two times in the last month and we're not really sure how to proceed.

Our RabbitMQ setup has two nodes on different servers, and the Rebus side is a windows service.

We see no errors in Rebus or in the eventlog on the server where Rebus runs. ~~We also do not see errors on the RabbitMQ servers.~~

Rebus (and the windows service) keeps running as we do see other log messages, like the DueTimeOutSchedular and timeoutreplies. However it seems the worker thread stops running, but without any errors being logged.

It results in a RabbitMQ input queue that keeps growing :(, we're adding logging to monitor this so we get notified if it happens again.

But I'm looking for advise on how to continue the "investigation" and ideas on how to prevent this. Maybe some of you have experienced this before?

UPDATE It seems that we actually did have a node crashing, at least the last time it happened. The master RabbitMQ node crashed (the server crashed) and the slave was promoted to master. As far as I can see from the RabbitMQ logs on the nodes everything went according to planned. There are no other errors in the RabbitMQ logs.

At the time this happened Rebus was configured to connect only to the node that was the slave (then promoted to master) so Rebus did not experience the rabbitmq failure and thus no Rebus connection errors. However, it seems that Rebus stopped handling messages when the failure occurred.

We are actually experiencing this on a few queues it seems, and some of them, but not all seems to have ended up in an unsynchronized state.

UPDATE 2 I was able to reproduce the problem quite easily, so it might be a configuration issue in our setup. But this is what we do to reproduce it

Start two nodes in a cluster, ex. rabbit1 (master) and rabbit2 (slave)
Rebus connects to rabbit2, the slave
Close rabbit1, the master. rabbit2 is promoted to master

The queues are mirrored

We have two small tests apps to reproduce this, a "sender" that sends a message every second and a "consumer" that handles the messages.

When rabbit1 is closed, the "consumer" stops handling messages, but the "sender" keeps sending the messages and the queue keeps growing.

Start rabbit1 again, it joins as slave

This has no effect and the "consumer" still does not handle messages.

Restart the "consumer" app

When the "consumer" is restarted it retrieves all the messages and handles them.

I think I have followed the setup guides correctly, but it might be a configuration issue on our part. I can't seem to find anything that would suggest what we have done wrong.

Rebus is still connected to RabbitMQ, we see that in the connections tab on the management site, the "consumers" send/recieved B/s drop to about 2 B/s when it stops handling messages

UPDATE 3 Ok so I downloaded the Rebus source and attached to our process so I could see what happens in the "RabbitMqMessageQueue" class when it stops. When "rabbit1* is closed the "BasicDeliverEventArgs" is null, this is the code

BasicDeliverEventArgs ea;
if (!threadBoundSubscription.Next((int)BackoffTime.TotalMilliseconds, out ea))
{
    return null;
}

// wtf??
if (ea == null)
{
    return null;
}

See: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L178

I like the "wtf ??" comment :)

Solution

That sounds very weird!

Whenever Rebus' RabbitMQ transport experiences an error on the connection, it will throw out the connection, wait a few seconds, and ensure that the connection is re-established again when it can.

You can see the relevant place in the source here: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L205

So I guess the question is whether the RabbitMQ client library can somehow enter a faulted state, silently, without throwing an exception when Rebus attemps to get the next message...?

When you experienced the error, did you check out the 'connections' tab in RabbitMQ management UI and see if the client was still connected?

Update:

Thanks for you thorough investigation :)

The "wtf??" is in there because I once experienced a hiccup when ea had apparently been null, which was unexpected at the time, thus causing a NullReferenceException later on and the vomiting of exceptions all over my logs.

According to the docs, Next will return true and set the result to null when it reaches "end-of-stream", which is apparently what happens when the underlying model is closed.

The correct behavior in that case for Rebus would be to throw a proper exception and let the connection be re-established - I'll implement that right away!

Sit tight, I'll have a fix ready for you in a few minutes!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow