RAFT consensus protocol - Should entries be durable before commiting

Question 1

I found the answer to the question by posting to raft-dev google group. I have added the answer for reference.

Please reference: https://groups.google.com/forum/#!msg/raft-dev/_lav2NeiypQ/1QbUB52fkggJ

Quoting Diego's answer:

For safety even in the face of correlated power outages, a majority of servers needs to have persisted the log entry before its effects are externalized. Any less than a majority and those servers could permanently fail, resulting in data loss/corruption

Quoting from Ben Johnson's answer to my email regarding the same:

No, a server has to flush entries to disk before being considered part of the quorum.

For example, let's say you have a cluster of nodes called A, B, & C where A is the leader.

Node A replicates an entry to Node B.

Node B stores entry in memory and responds to Node A.

Node A now has a quorum and commits the entry.

Node A then gets partitioned away from Node B & C.

Node B then dies and loses the in-memory copy of the entry.

Node B comes back up.

When Node B & C then go to elect a leader, the "committed" entry will not be in their log.

When Node A rejoins the cluster, it will have an inconsistent log. The entry will have been committed and applied to the state machine so it can't be rolled back.

Ben

Question 2

I disagree with the accepted answer.

A disk isn't mystically durable. Assuming the disk is local to the server it can permanently fail. So clearly writing to disk doesn't save you from that. Replication is durability provided that the replicas live in different failure domains which if you are serious about durability they will be. Of course there are many hazards to a process that disks don't suffer from (linux oom killed, oom in general, power etc), but a dedicated process on a dedicated machine can do pretty well. Especially if the log store is say ramfs, so process restart isn't an issue.
If log storage is lost then host identity should be lost as well. A,B,C identify logs. New log, new id. B "rejoining" after (potential) loss of storage is simply a buggy implementation. The new process can't claim the identity of B because it can't be sure that it has all the information that B had. Just like in the case of always flushing to disk if we replaced the disk of the machine hosting B we couldn't just restart the process with it configured to have B's identity. That would be nonsense. It should restart as D in both cases then ask to join the cluster. At which point the problem of losing committed writes disappears in a puff of smoke.

Question 3

I think entries should be durable before commiting.

Let's take the Figure 8(e) of the Raft extended paper as an example. If entries are durable when committed, then:

S1 replicates 4 to S2 and S3 then commit 2 and 4.
All servers crash. Because S2 and S3 don't know S1 has commited 2 and 4, they won't commit 2 and 4. Therefore S1 has commited 1,2,4, S2, S3, S4, S5 has commited 1.
All servers restart except S1.
Because only commited entries are durable, S2, S3, S4, S5 have the same single entry: 1.
S2 is elected as the leader.
S2 replicates a new entry to all other servers except the crashed S1.
S1 restarts. Because S2's entries are newer than S1, so S1's 2 and 4 are replaced by the previous new entry.

As a result, the commited entries 2 and 4 are lost. So I think the un-commited entries should be also durable.