I use etcd[1] which is an implementation of Raft for cluster configuration. I ha...

scottdw2 · on May 28, 2015

A few things:

1. You don't have to round trip to every member before acking a write request, only to a majority of them.

2. If you use the right data structures for your state machine (like persistent data structures), you don't have to fully commit a write before you start the next write. That is, as long as you can quickly revert to a previously committed state (which you can with persistent data structures, it just becomes a CAS on a pointer) you can pipeline requests and batch commits. That lets you start on write n+1 using the state machine results of write n, even though write n hasn't been committed yet. You can then commit a bunch of results (n, n-1, n-2, ...) as a single batch. That does create a dependency that some write j may be rolled-back because a write i < j failed to commit, but you would have had that dependency anyways (you wouldn't have even started write i unless write j had committed). So, although latency of a single request is bounded by the time to replicate a log entry to n/2 + 1 hosts, the throughput is not bounded by the latency of a single write.

The raft paper doesn't actually say you can do this (it says things like "don't apply state machines until a transaction has been committed), but that's because it assumes arbitrary side effecting state machines. However, with some restrictions (like using persistent data structures), you can relax that requirement. Relaxing that requirement can give quite good throughput.

3. You can horizontally scale raft the same way you horizontally scale any other distributed data store: using consistent hashing. You would use a hash ring of raft clusters.

4. You can loose up to n/2 hosts from a raft cluster and it will still work.

5. Raft is "multiple servers with failover and a version clock". In fact it's "multiple servers with failover, a version clock, and strong sequential consistency".

joshuak · on May 28, 2015

Thanks, what confuses me though is why you must retain n/2 hosts, and not be able to recover at least in many cases from a loss down to a single host. I understand that a known majority insures that a write will always persist/propagate even after a network segmentation and later re-join. But what if the network isn't segmented but the servers are simply down, or there wouldn't be a conflict regardless because the relevant clients couldn't have written to the segmented network anyway? When they rejoin they would all know they are behind the one that remained online (as one does with version clocks), so they could just get all missing events and be consistent again.

The majority requirement would, seems to me, reduce reliability over the ability to failover down to a single server.

Does anyone know of a list or test suite of all failures that a consistency algorithm is resistant too, then at least I could understand the reasoning. I get the impression that there are a lot of failure modes that are handled at some cost, but are not relevant to my use cases.

scottdw2 · on May 28, 2015

In a leader election, a node won't vote for a candidate that doesn't have a record it believes to be committed. Thus, writing to a majority ensures that when the leader dies, the new leader will have all the committed dats. Thus, up to n/2 - 1 nodes can be lost with a guarantee that no committed data is lost.

If you don't need strong consistency (after a write commits, all future reads will see the write), you can use simpler replication strategies.

Redis, for example, using async replication, that is not guaranteed to succeed. A redis master may ack a write, and a subsequent read may see it, but a loss of the master before replication occurs can result in data loss. The failover is not guaranteed to contain all writes. Some times weak consistency is ok. Sometimes it's not, but eventual consistency is ok. Other times strong consistency is needed.

Raft is useful when you need strong consistency.

Raft does support membership changes (adding and removing nodes from the cluster), so it can support losing more than n / 2 - 1 nodes, just not simultaneously.

joshuak · on May 28, 2015

(click) Now I finally get it! Thanks so much, very helpful.

justinsb · on May 28, 2015

How do you propose to differentiate between the 'network is segmented' vs 'servers are simply down' cases?

joshuak · on May 28, 2015

I meant to refer to the case in which there are potentially conflicting writes (segmented) vs no writes (offline).