Why is a threshold determined for Byzantine Fault Tolerance of an “Asynchronous” network? (where it cannot tolerate even one faulty node)

https://cs.stackexchange.com/questions/121187

29-09-2020
|

Question

In following answer (LINK: https://bitcoin.stackexchange.com/a/58908/41513), it has been shown that for Asynchronous Byzantine Agreement:

"we cannot tolerate 1/3 or more of the nodes being dishonest or we lose either safety or liveness."

For this proof, the following conditions/requirements has been considered:

Our system is asynchronous.

Some participants may be malicious.

We want safety.

We want liveness.

A fundamental question is that:

With considering the well-known paper titled: "Impossibility of Distributed Consensus with One Faulty Process" (LINK: https://apps.dtic.mil/dtic/tr/fulltext/u2/a132503.pdf)

showing that:

no completely asynchronous consensus protocol can tolerate even a single unannounced process death,

Can we still assume that the network is asynchronous ? As in that case the network cannot tolerate even one faulty node.

Solution

The answer has to do with tracking the precise assumptions that are made in these different results. In short, while both results assume asynchrony, the "impossibility of distributed consensus with one faulty process" requires a stronger form of liveness and determinism, and that makes consensus impossible.

1. Impossibility of Distributed Consensus with One Faulty Process

This seminal result (of Michael Fischer, Nancy Lynch, and Michael Paterson) is about distributed consensus where the system is not just asynchronous, but also satisfies:

Determinism. The consensus algorithm does not use any randomness.
Liveness under message delays. Not only may messages be delayed arbitrarily, but we also must guarantee liveness even in the presence of such continued delays.

Let us see why tehse properties are too strong, and make distributed consensus impossible.

Consider an example: Alice, Bob, and Charlie are friends and want to decide on where to meet for dinner. It is possible that one of them unexpectedly goes AWOL (or decides they are not interested in being friends anymore) and stops responding to messages. In this case, the other two can still decide on a place to meet. Now what should they do?

The obvious approach would be that:

Alice just decides where to go, and tells Bob and Charlie.

But this does not work because Alice may be the one who goes AWOL. So to fix this, the next most obvious approach might be:

Both Alice and Bob tell everyone else where to go. If everyone hears from Alice, then they will go where Alice says; otherwise, they will go where Bob says.

But this has a new problem. Suppose you are Charlie. If you hear from Alice, you know where to go. If you hear from neither of them, you wait to hear. The problem is when you have heard from Bob, but not Alice. Because there are arbitrary message delays, you cannot commit to go where Bob said: Alice might have said where to go, and you just have not received it yet! So you are completely stuck, and if it happens that Alice has gone AWOL, you will just keep waiting forever.

The problem here is that we have no way to abort the transaction; no way to say, "OK, this isn't working, it's been too long and I haven't received a delay -- let's try again." Real-world consensus algorithms (the best known being Paxos) have the possibility that rounds fail due to network delays, and in this case they just try again, hoping for shorter network delays. Additionally, it is possible to get around the problem by using randomized protocols which usually work, and only go on forever with small (or even zero) probability.

2. Asynchronous Byzantine Agreement where less than $1/3$ of the nodes fail

The bitcoinSE post you link glosses over the issue of liveness, saying only that it is "the ability to continue to make forward progress". In fact, the above result shows that the strongest form of this is impossible, so we have to relax our requirements / assumptions.

Let's consider two examples. In Miguel Castro and Barbara Liskov's "Practical Byzantine Fault Tolerance", they achieve practical liveness with less than a third of nodes being faulty by assuming that message delays do not continue to grow indefinitely. As the authors state:

We guarantee liveness, i.e., clients eventually receive replies to their requests, provided at most $\frac{n-1}{3}$ replicas are faulty and $delay(t)$ does not grow faster than $t$ indefinitely...This is a rather weak synchrony assumption that is likely to be true in any real system provided network faults are eventually repaired, yet it enables us to circumvent the impossibility result in [9].

Here [9] is the impossibility result discussed above. In plain terms for our example, they avoid the above problem with Charlie by requiring a weak form of synchrony: Charlie does not simply have to keep waiting forever, as we know that message delays can only grow termporarily, and not indefinitely. (Of course the actual algorithm gets a lot more complex, but that is partly the conceptual idea of why liveness is possible.)

In Ran Canetti and Tal Rabin's "Fast Asynchronous Byzantine Agreement with Optimal Resilience", they use randomness to get liveness with less than $n/3$ Byzantine node failures. From their paper:

In this setting, we describe an $(\lceil\frac{n}{3}\rceil- 1)$-resilient Byzantine Agreement protocol. With overwhelming probability all the non-faulty players complete the execution of the protocol. Conditioned on the event that all the non-faulty players have completed the execution of the protocol, they do so in constant expected time.

Here $(\lceil\frac{n}{3}\rceil- 1)$-resilient just means less than a third of nodes are Byzantine. Note the key words with overwhelming probability. So they have a probabilistic algorithm which has many possible "runs", and overwhelmingly most of them work. Note that the above impossibility result implies that there must always be some runs where liveness does not occur, i.e. there is no consensus:

Fischer, Lynch and Paterson’s [FLP] seminal impossibility result for deterministic protocols implies that any (randomized) protocol reaching BA must have nonterminating runs.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange