Pregunta

Once you create separate components that need to communicate with each other you enter the realm of systems programming where you have to assume that errors could originate at any step in the process. You throw try-catch blocks out the window and have to develop robust alternatives for error handling yourself.

We have two systems both with REST apis. Both systems have GUIs that users can use to add/update information. When information is added to one system it must be propagated to the other. We have integration software (the middleman) that polls on a minute-by-minute basis, picks up adds/edits and translates them from one system to the other. Each invokation keeps track of the timestamp of the last successful run--we have one timestamp for communication in either direction. In this way, if any part of the system fails, we can resume right where we left off when the issues are corrected.

I have heard bad things about poll-based approaches: namely the fact that it runs without regard to whether there is actually work. I have heard that push-based approaches are more efficient because they are triggered on demand.

I am trying to understand how a push-based approach might have worked. If either system attempts to push an add/edit, we have to assume that it could fail because the other system is down. It would seem to me that either system would need to maintain its own outgoing queue in order to resume once the issue with the other system is corrected.

It seems to me that using a push approach eliminates the middleman, but heaps more responsibility on each system to manage its messages to the other system. This seems to not be a clean way of separating concerns. Now both systems have to take on middleman responsibilities.

I don't see how you would redesign the middleman for a push-based architecture. You run the risk that messages are lost if the middleman himself fails.

Is there a fault-tolerant architecture that could be used to manage system interactions without the polling? I'm trying to understand if we missed a better alternative when we devised/implemented our poll-based middleman. The software does the job, but there's some latency.

¿Fue útil?

Solución

From your question and the comments to other answers already given, I feel like you're working on two things: 1) eliminating the middleman, 2) converting your poll-mechanism to a push-mechanism. I'd say, those can be viewed separately. Let's start with the poll-to-push change.

In general, I'd say yes, pushing is superior to polling. If I understand your infrastructure correctly, you have a middleman that polls in both directions looking for stuff to sync between the systems involved. I'd imagine there's already some kind of queuing in place. If your middleman polled some events from A and B is down, it'll keep those events until B is up again an re-transmits. Not sure if I got this right but assuming so, switching to push-mechanisms would be pretty simple and the benefit clear: Instead of the middleman fetching information (which may or may not be there), A (and B, of course) could just push their events into the middleman. It'll use its existing queues and either pushes to B immediately or whenever it sees fit. Now, if this is better for your situation depends on a lot of things, mainly the amount of events expected and the maximum delay allowed between syncs. Currently, any number of events will be synced with a delay of max. 1 minute. With pushing, the delay may decrease heavily but on the other hand, the load may increase if lots of events (that previously have been handled in a bunch) now result in many singular pushes. All of this can and must be handled by the middleman.

If you now try to eliminate the middleman, all that queuing must reside within the systems A and B. Whether that's feasible depends highly on what A and B currently do. It may be overstepping their responsibilities. Also, the currently very constant overhead of syncing might vary as described before. The middleman may be even good for load-reduction. Its mechanisms to forward incoming events could be set up to transmit them in batches. What you then gain is: Systems A and B that can have events, push them out and immediately forget about them. That nice and simple for them. The middleman could then, depending on various settings like maximum delay, maximum number of events, etc., sync those events into the respective other system.

I'm sure you see what I'm getting at and it may very well not be what you're looking for. But maybe it helps if you think about it this way. Let me know if I can clarify things.

Otros consejos

If the middleman is currently pulling data based on the time of the previous successful run, then, to cut out the middleman, each individual system could do the same in a push architecture. Yes, each system would need to keep track of this, but it's only a single timestamp, not a full blown message queue.

It would seem to me that either system would need to maintain its own outgoing queue in order to resume once the issue with the other system is corrected.

Yes, queueing is a good solution for such a scenario. That's what queues are made for: queing up messages for its consumers. But I do not quite get, why a system has to maintain its own outgoing queue.

A simple setup would be, that you have two services A and B, both of which have in Incoming Messages queue. Those queues act like mailboxes. A sends messages to B's Inbox and vice versa. The queues do, what they could do best: queueing those messages.

The conditions under which messages arrive at the consument of such a queue are a) on start, the service registers as a consumer and gets the information, that there are messages in the "mailbox", so the service pops them one by one until the "inbox" is empty and b) everytime a new message arrives, the service gets a notification and pops unread messages until the inbox is empty.

You decouple your systems with two message queues. You send messages in a fire-and-forget-manner. The question whether the recipient is alive doesn't bother the sender. A queue would do what your current middleware does - but in an easier way.

The point of failure shifts from the direct recipient, e.g. service A, to the queueing infrastructure, which has to be built resilient/redundant = fault-tolerant.

Each invokation keeps track of the timestamp of the last successful run--we have one timestamp for communication in either direction. In this way, if any part of the system fails, we can resume right where we left off when the issues are corrected.

The queues in the above scenario were acting like stacks and would preserve order.

And since it is a distributed system, you have to deal with CAP; simply spoken: the problem, that you have two out of sync services, which both could get queryied simultanously.

Licenciado bajo: CC-BY-SA con atribución
scroll top