Design pattern for large amounts of overflowing data?

https://softwareengineering.stackexchange.com/questions/300606

20-10-2020
|

Question

Our current queues publish messages that consumed by 3rd party services with rate limits. Currently the messages are retried with exponential back-off. However there could be cases where data is coming in so fast that the retries will never catch up.

Most of the 3rd party services offer alternative batch imports, and the solution I've come up with so far is to write the data to file(s) to be processed out of band.

Are there any design patterns for storing overflowing data?

Solution

The main problem as described is that Producers are faster than the consumers. This reminds me a lot of http://ferd.ca/queues-don-t-fix-overload.html . Reactive streams are an initiative that I noticed recently to provide some solution to this kind of setting.

You can have a look at any Queue oriented software products like:

Akka (letitcrash.com is their blog with some interesting general posts) or
ZeroMQ (Their guide offers some setups applicable to any Queue system)

to see how you can deal with overachieving producers.

However the main question remains how you want to deal with it from your business point of view? Since your consumers (3rd party) are limited in the amount of messages they can handle, your approach to batch import your buffered messages seems reasonable. Aggregating or even dropping messages might also be viable depending on your scenario.

No matter how you want to react to overflow, your queue should be made aware of this fact ( i.e. introducing back-pressure) and then you can apply your own strategy which will depend on your requirements.

In my last project I ended up pulling messages from a consumer through the queue system:

The producers were not hogging resources that could be used to work through the existing message pile.
The workers processing the messages could fetch new messages when they were ready without the need for scheduled pushing of new messages onto them.
I had a guaranteed downtime at specific intervals of the producers that allowed me to catch up since I could not drop any messages.

I hope this provides some handles that allow you to find your own solution.

OTHER TIPS

The ideal situation would be to have your queue store its messages indefinitely, provide you with message position as it delivers messages to you, and allow you to resume your subscription from a certain position (instead of only from now forward). That way, your subscriber can keep track of the last message it successfully processed and start from that position forward when it restarts. EventStore does this. It is an append-only database which also behaves as a queue that you can subscribe to.

Assuming your current queue only gives subscribers messages from now forward, you will have to store the messages as they come in (file or database or whatever). As you successfully send messages to 3rd parties, you will have to record the last successful one (per 3rd party) so you can resume from there if it gets too far behind or crashes. Periodically, I assume you will want to delete old messages which all 3rd parties have processed.

Checkpointing

You can either save the checkpoint (id of last successfully processed message) before or after delivering the message.

Saving checkpoint before delivery is called At Most Once Delivery, because it's possible that the message never gets delivered. For example: Checkpoint gets saved, but before delivery the computer crashes. Upon reboot, the loaded checkpoint says that message was already processed and loads next one.

Saving checkpoint after delivery is called At Least Once Delivery, because it's possible to deliver the message multiple times. For example: Message is sent, but computer crashes before checkpoint is saved. Upon reboot, checkpoint is pointing at same message that was sent, which will get resent again.

At Least Once Delivery (saving checkpoint after successful delivery) is safer so long as messages do not induce side effects (i.e. they are idempotent). For instance, if the message is CountIncreasedBy:5, then it has a side effect because it's new value depends on the old value, and processing this message more than once is going to skew the value on the receiver. However, if the message says NewCountSetTo:37, then you can process that any number of times in a row and it will always have the same effect.

However, with "at least once" you also have to watch for poison messages. These are messages which always crash when they are attempted to be delivered, thus you are not able to proceed any further. They will get retried infinitely unless you put something in place to watch for that.

Also note that if there are no circumstances in which the 3rd parties can catch up (i.e. it's not just burst traffic, but average traffic that overwhelms the 3rd party), then you are going to run into the same overflow problem on your hard drive. At that point, you will have to use offline methods or just skip messages once processing gets too far behind.

As I understand, there are components of three types in the system at issue.

                                         |
         ,----> [Message Translator] ----|---> [3rd Party Endpoint]
        /                                |
  [Queue] ----> [Message Translator] ----|---> [3rd Party Endpoint]
        \                                |
         `----> [Message Translator] ----|---> [3rd Party Endpoint]
                                         |

    [type] - component of type
    -----> - data flow
         | - internal/external component boundary
.

First thing that comes to mind is replacing message translators with aggregators, that store messages and send them in batch. Batch data can be stored in some fast (nosql) database or even in flat file if your requirements allow it. If 3rd party endpoints require raw data, not much more can be done. Generally, check out Enterprise Integration Patterns.

Event Sourcing techniques (and all databases based on them) can be useful, especially if your problem can be described in terms of Domain Objects and Domain Events (Domain Driven Design).

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange