Message receipt verification in a cluster

https://cs.stackexchange.com/questions/2784

16-10-2019
|

Question

At my current project I had a network problem come up for which I could not find a solution. In a peer-to-peer network I needed to send an action to all peers, and each peer was to act on it only if it could verify that all other peers would also act on it.

That is, given a network of peers $P = { P_1, ..., P_n }$. We wish to send, from some source peer $P_s$ a message to all other peers. This message contains an action which must be performed. The peer should perform this action if and only if every other peer will perform the action. That is, it performs the action if it can verify that all other peers will also have receipt of the action and can perform the same verification.

The problem is subject to these conditions:

There is no implicit message delivery guarantee: if $P_x$ sends a message to $P_y$ there is no way for $P_x$ to know if $P_y$ gets the message. (Of course $P_y$ can send a receipt, but that receipt is subject to the same constraint)
Additional messages with any payload may be created.
There is no total ordering on the messages received by peers. Messages can arrive in a different time-order than which they were sent. This time-order may be unique per peer. Two messages sent in order from $P_x$ to $P_y$ are very unlikely to arrive out of order.
Messages can arrive at any point in the future (so not only are they not ordered, they can be indefintely delayed). A message cannot inherently be detected as lost. Most messages will be delivered quickly, or truly lost.
Each peer has a synchronized clock. It is accurate enough in the domain of scheduling an action and to approximately measure transmission delays. It is however not accurate enough to establish a total ordering on messages using timestamps.

I was not able to find a solution. I'm interested in a guarantee and not simply a high probability of being correct (which can be done simply be repeatedly sending confirmations from peer to peer and rejections upon any likely loss.) My stumbling block is the inability to verify that any particular message actually arrived. So even if $P_x$ determines there is an erorr, there is no guaranteed way to tell the other peers about it.

A negative confirmation is also acceptable. I have a suspicion that a guarantee cannot actually be achieved, only an arbitrarily high probability.

Solution

Your suspicion is correct: under these conditions, there is no way to guarantee a solution. Yours is a consensus problem, and there is a general result that states the impossibility of consensus in an asynchronous network with failures. This result is sometimes known as FLP after its authors (Fischer, Lynch and Paterson) [FLP].

The original consensus states roughly this (see [FLP] or [L96] or [T00]): suppose you have $N$ processes with $N \ge 3$. The processes can communicate by sending asynchronous messages: there is no notion of real time, messages can take arbitrarily long in transit. Furthermore, a process may fail (it stops emitting messages). Define a configuration as an assignment of boolean values (0 or 1) to each process, and consensus as a configuration in which all non-failed processes have the same value. Each process may change its assigned value based deterministically on the messages it has received; the collection of algorithms executing on each process defines a distributed algorithm. The FLP result states that there is no algorithm that can reach a consensus from an arbitrary initial configuration, such that there is always at least one process that keeps its initial value. (In other words, the only solutions are the trivial one where all processes pick the same value.)

In your case, the failures are in transmissions rather than in processes, but this only makes the situation more difficult: you can simulate a process failure by losing all the messages to and from that process.

The FLP result implies that there is no distributed algorithm that will always (given sufficiently many received messages) lead to either all processes deciding on taking the action, or no process taking the action, unless they are all primed to act or all primed not to act.

Having clocks does not help: they are part of the local algorithm. They would help if you could put a bound on the message travel times: timeouts in transmissions are the usual way of solving consensus in practice.

Bibliography

[FLP] Michael J. Fischer and Nancy A. Lynch and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process". April 1985. Journal of the ACM 32 (2). DOI:10.1145/3149.214121

[L96] Nancy Lynch. Distributed Algorithms. Elsevier, 1996. 1-55860-348-4. http://groups.csail.mit.edu/tds/distalgs.html

[T00] Gerard Tel. Introduction to Distributed Algorithms. 2nd ed. Cambridge University Press, 2000. 0521794838.

OTHER TIPS

Informally, the problem is similar to Sicilian reasoning. Any given agent can only act if

it got the original message,
it knows that everybody else got the message,
it knows that everybody else knows it got the message,

and so on. In your model, there is no tool to break this dilemma: all communication is explicit and therefore prone to loss. In addition, agents have not way to decide whether another (known) agent is still there.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange