How to guarantee HTTP message delivery in fault tolerant way

https://softwareengineering.stackexchange.com/questions/379822

14-02-2021
|

Question

When an application A communicates with an external/3rd party system B, there is always a chance that B is down.

Say that A raises an event that should be sent as a message to B via HTTP, what is the best way to guarantee that the message is delivered?

One possibility is of course to have some retry logic to resend the message a few times. But what should we do with the message if delivery fails too many times? Or if A crashes (maybe due to too many messages waiting to be sent)? Then we need a way to persist those messages to be able to resume message delivery after A has recovered.

My first idea was to store all events in a dedicated table in the database and mark off when they are sent. Then a colleague argued that we can't always rely on the database and we should instead store the messages locally on the filesystem. But the latter approach looks like we'd be implementing a message queue ourselves and we'd be better of with a real full-fledged message queue (which we currently don't have for this application).

The same colleague then argued that even if we have a message queue, we can't be sure that the message is delivered to the queue and we'd still need to implement a queue on the filesystem. That really seems overkill to me. That would mean that to be really sure we need to implement a locally stored message queue for all communications even between our own different microservices.

For context this is low volume a system with few messages (at most in the 100s) per day, but they have very high value (used for billing) so that we don't want to miss any.

Any thoughts?

Solution

The database solution is definitively the best, transactional filesystem are not common, unless you consider that filesystem never fail (permission settings, disk full,...).

I'll detail a more accurate scenario of what you suggest in order to make sure you doesn't lose an entry, using transactions.

When needing to send something :

Create an persistent entry with a "pending" status and a link to the data you need to send.
Try to send it, if it fails, let it to the pendingstatus, otherwise update the status to "finished".
Commit the current transaction
Have a dedicated background worker checking for pending stuff to be sent and try sending them. Eventually before trying to send them, the worker may try to check if the 3rd party service is available. Open a transaction for each message to send them and update them independantly. Make sure an error on the message don't fail the sending of others (catch exceptions). When done successfully update the status to "finished".

This also works for task like pushing a file to an ftp server.

If the ordering is important the workflow is simpler : never try to send a message, juste queue it by adding an entry to be taking by the background worker. And if one message fail, don't try to send others. So your table is a blocking queue of task to do.

OTHER TIPS

Your colleague is right. You can't eliminate all failure modes. The goal should be predictable failure modes, e.g. to meet a certain SLA perhaps you want 99.99% reliability and a response time of under 24 hours in case of total failure, that sort of thing.

To achieve a goal like that, many organizations will choose a proven platform (e.g. MSMQ, or even just SMTP). The advantage of a proven platform is that it has been thoroughly tested both in the lab and in the market and people pretty much know what it does when things go wrong. Generally they will come with your choice of persistent storage, plus important things you didn't think of like queue monitoring, performance counters, throttling, and email/SMS alerts for your operations team. Also, there will be a user community, and possibly technical articles on how to set it up with hot standby, disaster recovery, second-siting, etc. It may even be available as a service offering from your cloud provider.

Unless you work for an organization whose core competency is messaging protocols, you are probably better off investigating third party options than growing your own. That way you can spend more time on your core business.

IMHO, both the database and filesystem approach to queue messages are fine.

Both will save you in case of a process restarts, crashes or disconnections.

However, even if this covers most of the incidents, it doesn't shield against "disasters". By "disaster", I mean a non-recoverable machine for whatever reason. It's extremely rare, but it might happen. A good way to protect against this is having redundant nodes, so that when one is dead, the system as a whole continues working. However, this brings a lot of other challenges with it.

As other answers has mentioned, the local filesystem is not always available either. If you really need a reliable delivery under all possible circumstances, you are screwed.

However, you don't actually need to guarantee delivery under all circumstances. You really only need to deliver an invoice if the order itself was successfully created. In fact, if the order creation is failed, you probably don't want to send an invoice.

This leads to the consequence that whatever you do, you must finalize the purchase orders and the invoice delivery plan in a way that make sure that they are either successfully persisted together or they fail together.

If you store your purchase order in a database, then you should use the same database transaction to atomically persist the invoice delivery plan.

If you use a message queue, you'll need to queue both order and invoice delivery in the message queue, atomically; this would also mean that your message queue would become the source of truth regarding successful and orders.

The filesystem is a bit tricky. In fact I don't think there's a single scenario where storing it in the filesystem would be the best solution. Turning a filesystem into a transactional system is not a trivial task in most filesystems.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange