How to implement a message queue over Redis?

https://softwareengineering.stackexchange.com/questions/204623

29-09-2020
|

Question

Why Redis for queuing?

I'm under the impression that Redis can make a good candidate for implementing a queueing system. Up until this point we've been using our MySQL database with polling, or RabbitMQ. With RabbitMQ we've had many problems - the client libraries are very poor and buggy and we'd like not to invest too many developer-hours into fixing them, a few problems with the server management console, etc. And, for the time being at least, we're not grasping for milliseconds or seriously pushing performance, so as long as a system has an architecture that supports a queue intelligently we are probably in good shape.

Okay, so that's the background. Essentially I have a very classic, simple queue model - several producers producing work and several consumers consuming work, and both producers and consumers need to be able to scale intelligently. It turns out a naive PUBSUB doesn't work, since I don't want all subscribers to consume work, I just want one subscriber to receive the work. At first pass, it looks to me like BRPOPLPUSH is an intelligent design.

Can we use BRPOPLPUSH?

The basic design with BRPOPLPUSH is you have one work queue and a progress queue. When a consumer receives work it atomically pushes the item into the progress queue, and when it completes the work it LREM's it. This prevents blackholing of work if clients die and makes monitoring pretty effortless - for instance we can tell if there is a problem causing consumers to take a long time to perform tasks, in addition to telling if there is a large volume of tasks.

It ensures

work is delivered to exactly one consumer
work winds up in a progress queue, so it can't blackhole if a consumer

The drawbacks

It seems rather strange to me that the best design I've found doesn't actually use PUBSUB since this seems to be what most blog posts about queuing over Redis focus on. So I feel like I'm missing something obvious. The only way I see to use PUBSUB without consuming tasks twice is to simply push a notification that work has arrived, which consumers can then non-blocking-ly RPOPLPUSH.
It's impossible to request more than one work item at a time, which seems to be a performance problem. Not a huge one for our situation, but it rather obviously says this operation was not designed for high throughput or this situation
In short: am I missing anything stupid?

Also adding node.js tag, because that's the language I'm mostly dealing with. Node may offer some simplifications in implementing, given its single-threaded and nonblocking nature, but furthermore I'm using the node-redis library and solutions should or can be sensitive to its strengths and weaknesses as well.

Solution

If you want to use Redis for a message queue in Node.js and you don't mind using a module for that then you may try RSMQ - the Redis Simple Message Queue for Node. It was not available at the time this question was asked but today it is a viable option.

If you want to actually implement the queue yourself as you stated in your question then you may want to read the source of RSMQ because it's just 20 screens of code that does exactly what you are asking for.

See:

OTHER TIPS

I've run into some difficulties thus far I'd like to document here.

How do you handle reconnect logic?

This is a hard problem and an especially hard problem in designing and implementing a message queue. Messages must be able to queue up somewhere when consumers are offline, so a simple pub-sub is not strong enough, and consumers need to reconnect in a listening state. Blocking pops are difficult state to maintain, because they are a non-idempotent listening state. Listening should be an idempotent operation, yet when dealing with a disconnect with respect to a blocking pop, you have the pleasure of thinking very hard about whether the disconnect happened just after the operation succeeded or just before the operation failed. This isn't insurmountable, but it's undesirable.

Furthermore, the listening operation should be as simple as possible. Ideally it should have these properties:

Listening is idempotent.
The consumer is always listening, and throttling logic is processed outside of the listening logic code. RabbitMQ encapsulates this by letting the consumer bound the number of unacked messages it can have.
In particular I went with a poor design in which re-entering a blocking pop was contingent on success of previous operations, which was brittle and required thinking hard.

I'm now favoring a Redis PUBSUB + RPOPLPUSH solution. This decouples notification of work from consumption of work, which lets us factor out a clean listening solution. The PUBSUB is only responsible for notification of work. The atomic nature of RPOPLPUSH is responsible for consumption, and delegating work to exactly one consumer. At first this solution seemed needlessly complicated compared to a blocking pop, but now I see that the complication was not needless at all; it was solving a hard problem.

However this solution isn't quite trivial:

consumers should also check for work on reconnect.
consumers may want to do a poll for new work anyway, for redundancy. Should the poll actually succeed a warning should be emitted, since this should only happen between the consumption on the PUBSUB and the poll on an RPOPLPUSH. Therefore many poll successes indicate a broken subscription system.

Note that the PUBSUB/RPOPLPUSH design also has scaling problems. Every consumer receives a lightweight notification of every message, which means this has an unnecessary bottleneck. I suspect it's possible to use channels to shard the work but this is probably a tricky design to work out well.

So the biggest reason for choosing to use RabbitMQ over Redis is the failure scenarios, and clustering.

This article really explains it best, so I'll just provide the link:

https://aphyr.com/posts/283-jepsen-redis

Redis sentinel and more recently redis clustering are not able to handle a number of very basic failure scenarios that made it a bad choice for a queue.

RabbitMQ has its own set of issues, however that being said it is incredibly solid in production and is a good message queue.

Here is the post for rabbit:

https://aphyr.com/posts/315-jepsen-rabbitmq

When you look at the CAP theorum (consistency, availability, and partition handling) you can only choose 2 of 3. We are leveraging RMQ for the CP (consistency and partition handling) with our message load, if we are unavailable, it isn't the end of the world. In order to not lose messages, we use ignore for the partition handling in order to not lose messages. Duplicates can be handled since the source manages the UUID.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange