The Problem We're Actually Trying to Solve
Imagine you have three database replicas. A client sends a write request. All three replicas need to agree on the order of writes so that any client reading from any replica sees the same data. Simple enough, right?
Now add failure. One of the replicas crashes halfway through. Does the write count? Did the other two record it? Does the crashed node have it when it comes back? Now add a network partition — one replica can talk to replica 2 but not replica 3, and replica 3 can't talk to anyone. Who accepts the write?
This is the consensus problem: getting a group of nodes to agree on a single value, even when nodes crash and messages get delayed or lost. It sounds simple. It is not. And the reason it's not simple isn't just implementation difficulty — it's that the problem has fundamental mathematical limits.
The precise definition
An algorithm solves consensus if it satisfies these three properties at the same time:
- Termination: Every node that doesn't crash eventually makes a decision. The algorithm doesn't hang forever.
- Agreement: Every node that decides, decides the same value. No two nodes decide differently.
- Validity: The decided value must have been proposed by some node. You can't just pick a value out of thin air.
These look easy to satisfy separately. Satisfying all three together, in the presence of failures, is what makes consensus hard.
The most subtle property is termination. An algorithm that says "wait until you're certain" can satisfy agreement and validity easily — but it might wait forever. The challenge is designing an algorithm that terminates even when nodes crash and messages are delayed.
Why Consensus Is Hard: The FLP Result
In 1985, Fischer, Lynch, and Paterson proved something startling: in an asynchronous distributed system where even one node might fail, it is impossible to guarantee that a consensus algorithm terminates. This is the FLP impossibility result, and it's one of the most important theoretical results in distributed systems.
Let's understand what "asynchronous" means here. In an asynchronous system, there are no guaranteed bounds on how long a message can take to arrive or how long a node can take to respond. You can't tell the difference between a node that crashed and a node that's just slow. You can't tell the difference between a message that was lost and a message that's still in transit.
The FLP result says: if you cannot bound these delays, then a single crashed node can make it impossible for the remaining nodes to reach a decision — because no one can tell if it crashed or is just slow, and waiting for it might mean waiting forever.
Scenario: 3 nodes, Node C is unresponsive Node A ──── propose(1) ────► Node B │ Node A ─── ? ──────────► Node C ← crashed? or just slow? If A waits for C → might wait forever (breaks Termination) If A decides without C → C might wake up (risks Agreement)
This is not a fixable implementation bug. It's a provable limit of what's achievable.
But we do have consensus algorithms — how?
Real systems work around FLP in a practical way: they give up on the purely asynchronous model and add timeouts. If you add timeouts, you're saying "if a node doesn't respond within X milliseconds, treat it as failed." This is a weak form of synchrony — you're adding a bound on message delays, even if that bound is imprecise.
With timeouts, you can build consensus algorithms that work well in practice, even though they can't formally guarantee termination in the face of arbitrary asynchrony. This is the pragmatic trade-off that Paxos, Raft, and all real consensus systems make.
FLP doesn't mean "consensus is impossible." It means "consensus is impossible if you assume zero guarantees about timing." Since real systems do have roughly-bounded network delays, practical consensus is achievable. FLP explains why your consensus algorithm will sometimes need to retry and why you should never design for zero latency overhead.
Paxos: The Original Algorithm
Paxos was invented by Leslie Lamport in the 1980s and published in 1998. For many years it was the only well-understood consensus algorithm. Almost every consensus system that predates 2014 is based on Paxos or a variant of it: Google Chubby, Apache Zookeeper, and many others.
Paxos solves single-value consensus — getting a group of nodes to agree on a single value. Here's how it works.
The two phases
In Paxos, a node that wants to propose a value is called a proposer. The nodes that vote are called acceptors. There are usually more acceptors than proposers, and a value is chosen when a majority of acceptors agree on it.
Phase 1: Prepare / Promise
The proposer picks a proposal number n (which must be higher than any number it has used before) and sends a Prepare(n) message to a majority of acceptors.
Each acceptor, on receiving Prepare(n), does two things: it promises never to accept any proposal with a number less than n, and it reports back the highest-numbered proposal it has already accepted (if any).
Proposer Acceptor 1 Acceptor 2 Acceptor 3 │──── Prepare(n=5) ────►│ │ │ │──── Prepare(n=5) ──────────────────►│ │ │──── Prepare(n=5) ───────────────────────────────►│ │ │ │ │ │◄─── Promise(n=5, ─────│ │ │ │ last=none) │ │ │ │◄─── Promise(n=5, ──────────────────│ │ │ last=none) │ │ │ ← slow, ignore │ Proposer has promises from a majority (2 of 3). Proceed.
Phase 2: Accept / Accepted
Now the proposer looks at all the Promise responses. If any acceptor reported a previously accepted value, the proposer must use that value (not its own). This is the key safety rule — it ensures you don't accidentally overwrite something that was already chosen.
If no previously accepted value was reported, the proposer can use its own value. It then sends an Accept(n, value) message to a majority of acceptors.
Each acceptor that hasn't made a newer promise accepts the proposal and sends an Accepted message back. If a majority of acceptors send back Accepted, the value is chosen.
Proposer Acceptor 1 Acceptor 2 Acceptor 3 │──── Accept(n=5, ──────►│ │ │ │ value="x") │ │ │ │──── Accept(n=5, ────────────────────►│ │ │ value="x") │ │ │ │ │ │ │◄─── Accepted(n=5) ─────│ │ │ │◄─── Accepted(n=5) ──────────────────│ │ │ Value "x" is chosen. Majority accepted it.
Why the constraint on Phase 2 is so clever
The rule "if any acceptor reported a previously accepted value, use that" is the heart of Paxos's safety guarantee. It prevents a scenario where two proposers both think they have a majority and choose different values.
Suppose Proposer A chose value "x" in a previous round, and a majority of acceptors accepted it. Now Proposer B starts a new round with a higher proposal number. When B sends Prepare messages, at least one acceptor from the majority that accepted "x" will respond — because B needs a majority, and A's majority and B's majority must overlap by at least one node. That overlapping acceptor will tell B about "x". B is then forced to use "x" too. Agreement is preserved.
The magic of majority quorums is that any two majorities in a cluster of N nodes must share at least one node. A majority requires more than N/2 nodes, so two majorities require more than N nodes total — but there are only N nodes, so they must overlap. This overlap is how information flows between rounds and how consistency is maintained.
The problem with Paxos in practice
Paxos solves single-value consensus, but real systems need to agree on a sequence of values — a replicated log. Extending Paxos to Multi-Paxos (as it's called) requires answering a bunch of questions that the original paper doesn't answer: How do you elect a stable leader? How do you handle gaps in the log? How do you recover from leader crashes? How do you add or remove nodes?
Each team that has implemented Paxos has had to invent their own answers to these questions. Google's Chubby team famously wrote that "there are significant gaps between the description of the Paxos algorithm and the needs of a real-world system." Implementations diverge in subtle ways, and the correctness of those divergences is difficult to verify.
Never implement Paxos yourself unless you are building a distributed systems infrastructure product and have months to spend on it. The gap between the algorithm and a correct, production-ready implementation is enormous. Use an existing system (etcd, ZooKeeper) or use Raft, which was explicitly designed to be implementable.
Raft: Consensus Made Readable
Raft was published in 2014 by Diego Ongaro and John Ousterhout. The explicit goal was understandability — they ran user studies to verify it was easier to understand than Paxos. Raft is now the basis for etcd (which backs Kubernetes), CockroachDB, TiKV, and many others.
Where Paxos tackles the problem as a whole, Raft decomposes it into three relatively independent sub-problems:
- Leader election — choosing one node to coordinate all writes
- Log replication — the leader accepts writes and replicates them to followers
- Safety — ensuring that committed entries are never lost, even when leaders change
The core idea: one leader at a time
Raft is a strongly leader-based algorithm. At any given time, there is exactly one leader (or no leader, briefly, during an election). All client writes go to the leader. The leader replicates writes to followers. This makes the system much easier to reason about — there's one place where all decisions are made.
Time in Raft is divided into terms. Each term starts with an election. Terms are numbered with increasing integers. The term number serves as a logical clock — if you receive a message from a higher term, you know your information is stale.
Term 1 Term 2 Term 3 Term 4 ──────────────┼─────────────┼─────────────┼────────────── [Leader: A] [election] [Leader: B] [Leader: B] A crashed B elected B re-elected split vote Each term has at most one leader. Terms act as logical clock.
Leader Election
Every node in Raft is in one of three states: leader, follower, or candidate. Nodes start as followers. Followers expect a heartbeat from the leader at regular intervals. If a follower doesn't hear from a leader within its election timeout (typically 150–300ms), it assumes the leader has crashed and starts an election.
To start an election, a node converts itself to a candidate, increments its term number, votes for itself, and sends RequestVote messages to all other nodes.
A node grants its vote to a candidate if two conditions are met: it hasn't already voted in this term, and the candidate's log is at least as up-to-date as the voter's log. The second condition is crucial for safety — it ensures a candidate that is missing committed entries can't become leader and overwrite those entries.
┌──────────────────┐
│ │
┌─────▼──────┐ timeout │
│ Follower │◄──────────┘
└─────┬──────┘
│ timeout (no heartbeat)
▼
┌────────────┐ receives votes from majority
│ Candidate ├──────────────────────────────► Leader
└──────┬─────┘
│ discovers current leader
│ or higher term
▼
┌────────────┐
│ Follower │
└────────────┘
If a candidate gets votes from a majority of nodes, it becomes the leader. It immediately sends heartbeat messages to all followers to assert its leadership and prevent new elections.
If two candidates start elections at the same time, it's possible no one gets a majority (split vote). Both candidates time out, increment their term, and try again. Raft uses randomized election timeouts — each node picks a random timeout between, say, 150ms and 300ms — so split votes are rare and resolve quickly.
Log Replication
Once a leader is elected, it handles all client requests. When it receives a write, it doesn't apply it immediately. Instead, it appends it to its own log as a new entry and sends AppendEntries messages to all followers in parallel, asking them to do the same.
An entry is considered committed once the leader has received acknowledgment from a majority of nodes. Once committed, the leader applies the entry to its state machine and responds to the client. Future AppendEntries messages will tell followers that the entry is committed, and they'll apply it too.
Client writes "x=5" Leader [1:set_a] [2:set_b] [3:x=5] ← append to own log │ ├──── AppendEntries ────► Follower 1 [1][2][3] ✓ ack ├──── AppendEntries ────► Follower 2 [1][2][3] ✓ ack └──── AppendEntries ────► Follower 3 ← unreachable Majority (Leader + 2 followers) have entry 3. Entry 3 is now committed. Leader responds to client.
Notice that Follower 3 being down didn't stop the write from committing. The cluster needs a majority, not everyone. This is what gives Raft its fault tolerance — a cluster of 5 nodes can tolerate 2 failures; a cluster of 3 can tolerate 1.
Raft clusters are almost always an odd number of nodes. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2. A 7-node cluster tolerates 3. You almost never need more than 5 nodes for a consensus cluster — beyond that, the coordination overhead grows faster than the fault tolerance benefit.
Log Consistency: the guarantee that makes it all work
Raft maintains a strong invariant called the Log Matching Property: if two logs contain an entry with the same index and term, then all entries up to that index are identical in both logs.
This is enforced by AppendEntries: along with the new entries, the leader includes the index and term of the entry immediately before the new entries. A follower accepts the new entries only if its own log has an entry matching that index and term. If it doesn't, it refuses and tells the leader. The leader then backs up and tries sending more of the log history until it finds where the follower's log diverges.
This consistency check means a new leader can always bring all followers in sync, even if they have different gaps in their logs.
Safety: the rule that prevents data loss
Here's the critical safety property Raft must guarantee: if an entry has been committed, it must not be lost, even across leader elections. A new leader must always have all committed entries in its log.
Raft achieves this through the vote restriction: a candidate can only win an election if its log is at least as up-to-date as the majority's logs. "Up-to-date" means: the candidate's last log entry has a higher term, or if terms are equal, the candidate's log is at least as long.
Since a committed entry was written to a majority of nodes, and a new leader needs votes from a majority of nodes, at least one node in the new leader's majority must have the committed entry. The vote restriction ensures that the new leader has that entry too.
Raft guarantees that any committed entry survives any number of leader elections. The combination of majority quorums and the vote restriction creates an unbroken chain: no new leader can be elected without having all the entries that a previous leader committed.
The Split-Brain Problem
Split-brain is what happens when a network partition causes a cluster to split into two groups, and both groups believe they are the majority. If both groups elect a leader and start accepting writes, you now have two independent streams of commits diverging from each other. Reconciling them later is extremely painful — and in many cases, the inconsistency is simply unresolvable.
Network partition splits 5-node cluster into two groups Group A (nodes 1,2) Group B (nodes 3,4,5) Leader A Leader B │ │ writes: x=1, x=2, x=3 writes: x=10, x=11 Partition heals. Now what? x=3 or x=11? Both were "committed" — but to different majorities. THIS CANNOT HAPPEN IN RAFT.
Raft prevents split-brain through the majority quorum requirement. In a 5-node cluster, a majority is 3 nodes. Group A (2 nodes) cannot elect a leader — it can't get 3 votes. Group B (3 nodes) can elect a leader and can commit entries. When the partition heals, Group A's nodes rejoin and find they are behind; they simply catch up by copying the log from the Group B leader.
The key insight: you can't have two majorities in the same cluster at the same time. Any two groups that each claim majority must overlap — and if they overlap, the overlapping node won't vote for two different leaders in the same term.
The price of preventing split-brain is that the minority partition (Group A, 2 nodes) becomes unavailable for writes. This is CAP theorem in action — Raft chooses consistency over availability. During a partition, it's safer to refuse writes than to allow divergence. Systems that need 100% write availability cannot use strong consensus.
Membership Changes: Adding and Removing Nodes
Adding or removing nodes from a Raft cluster is surprisingly dangerous. The naive approach — just update the cluster config — creates a window where two different majorities can exist simultaneously.
Imagine you have a 3-node cluster (majority = 2) and you're adding 2 nodes to get to 5 (majority = 3). During the transition, old nodes think majority = 2 and new nodes think majority = 3. A leader election could succeed under both definitions, potentially electing two leaders.
Raft's solution is the joint consensus approach (original paper) or the simpler single-server changes approach. Single-server changes says: only add or remove one node at a time, and wait for that change to commit before making the next change. Adding one node at a time never changes the majority count enough to allow two simultaneous majorities.
This is why tools like etcd have explicit cluster resize procedures that must be followed carefully. Skipping steps or doing multiple changes at once can corrupt the cluster.
What You're Actually Buying: etcd, ZooKeeper, and Consul
Most engineers will never implement Raft themselves. Instead, they'll use a system that already implements it. Understanding what these systems actually provide will help you use them correctly.
etcd
etcd is a distributed key-value store that uses Raft. It's the backing store for Kubernetes — cluster configuration, service discovery data, and distributed locks all flow through etcd. Every write to etcd goes through Raft, which means it's linearizable: if a write succeeds, any subsequent read (from any node) will see that write.
etcd also supports watches — a client can watch a key and get notified the instant it changes. This makes it useful for leader election: multiple services compete to write a key; whoever succeeds is the leader; others watch the key and take over when it disappears (because the leader crashed and its TTL-based key expired).
ZooKeeper
ZooKeeper predates Raft and uses a Paxos-like protocol called ZAB (ZooKeeper Atomic Broadcast). The data model is a tree of nodes (znodes), each of which can hold data and have watchers attached to it. ZooKeeper has been the backbone of Kafka (for topic/partition metadata), HBase, and many other large-scale systems.
ZooKeeper's ephemeral nodes are one of its most powerful features: a znode that is automatically deleted when the session of the client that created it ends. This is the building block for distributed locks and leader election. If a leader crashes, its session times out, its ephemeral node disappears, and a watcher on that node triggers a new election.
Consul
Consul uses Raft and provides a higher-level API compared to etcd and ZooKeeper. It includes built-in support for service discovery, health checking, and key-value storage. The API is designed around common coordination patterns, so you don't have to build leader election or distributed lock primitives yourself — they're available directly.
| System | Protocol | Data Model | Best For | Consistency |
|---|---|---|---|---|
| etcd | Raft | Key-value | Kubernetes config, simple coordination | Linearizable |
| ZooKeeper | ZAB | Tree (znodes) | Kafka, HBase, complex watchers | Sequential |
| Consul | Raft | Key-value + Services | Service discovery, health checks | Linearizable (default) |
When You Actually Need Consensus
Consensus is expensive. Every write has to travel to a majority of nodes and wait for acknowledgment before it can be confirmed. This is inherently slower than writing to a single node. You should reach for consensus systems only when you genuinely need their guarantees.
Good use cases for consensus
- Leader election: Multiple replicas of a stateful service; only one should be the primary at any time. etcd or ZooKeeper is perfect for this.
- Distributed locks: A critical section that only one node across the entire cluster should execute at a time. Again, use an existing system's lock primitive.
- Cluster membership and configuration: Which nodes are in the cluster? What is the current config version? These are small, infrequently written, and must be consistent.
- Sequence numbers / unique IDs: Generating globally unique, monotonically increasing IDs across a distributed system.
- Atomic compare-and-swap: "Update this value only if it's still X" — the building block for optimistic concurrency control.
Bad use cases for consensus
- Primary data storage: Don't store your application's main data in etcd or ZooKeeper. They're designed for small amounts of coordination data, not gigabytes of user data. etcd's recommended max cluster size is a few gigabytes.
- High-throughput writes: If you need to write thousands of records per second, consensus overhead will dominate. Use a database designed for that workload, with consensus only for metadata.
- Frequently changing data: If your data changes every millisecond, consensus is too slow. Cache it, use eventually consistent storage, and use consensus only to elect who manages the cache.
Think of consensus as the "control plane" for your system. Your application data is the "data plane." The control plane — who is the leader, what's the current config, who holds the lock — flows through consensus. The data plane — user records, events, messages — flows through whatever storage system is right for that data's access patterns.
The hidden cost: consensus in the critical path
The worst mistake you can make with consensus is putting it in the critical path of every user request. If every API call requires a write to etcd before it can proceed, you've made etcd's latency your API's latency floor — and etcd's leader your single point of failure.
The pattern to follow: use consensus to elect a leader, then let the leader make decisions independently without consulting consensus for each one. The leader might cache the config, hold the lock, or manage a local counter. Consensus is used to elect and coordinate — not to execute every action.
❌ Wrong: Consensus in every request Client → Service → etcd (consensus write) → respond Adds ~5-20ms to every request, etcd becomes bottleneck ────────────────────────────────────────────────────── ✓ Right: Consensus for leadership, not per-request Startup: Service A → etcd (acquire leader lock) → A is leader Startup: Service B → etcd (try lock, fail) → B is standby Runtime: Client → Service A (leader) → local decision → respond ~no consensus overhead per request Failover: A crashes → etcd TTL expires → B acquires lock → B is leader
What "Consensus" Gives You vs. What It Doesn't
A common confusion is conflating consensus with all forms of consistency. Consensus gives you a specific thing: agreement on a sequence of values across a cluster of nodes. But it doesn't automatically solve all distributed problems.
Specifically, consensus does not give you:
- Fast reads from followers: In a standard Raft setup, reads from followers can return stale data (the follower might not have applied the latest committed entries yet). Truly linearizable reads require either reading from the leader or using a ReadIndex mechanism.
- Partition tolerance beyond majority: If more than half your nodes fail, the cluster is unavailable. Consensus doesn't give you 100% availability under any failure.
- Multi-key atomicity beyond what you build: Raft commits log entries one at a time. Transactions across multiple keys require building a transaction layer on top of Raft (which is what databases like CockroachDB do).
- Performance for free: Raft commits require at least one network round trip to a majority. In a 3-node cluster in a single datacenter, that's fast (a few milliseconds). Across regions, it's slow (100ms+).
Putting it together: consensus is the tool you reach for when you genuinely need multiple nodes to agree on a value with no possibility of divergence. The right way to use it is narrowly — for control plane decisions, leadership, and short-lived coordination primitives — while keeping your hot data path away from its latency and throughput constraints.
Chapter Summary
The Key Principle
Consensus is the tool you reach for when a group of nodes must agree on a value and no amount of "we'll reconcile later" is acceptable. Use it for the narrow set of coordination problems where divergence is catastrophic — leader election, distributed locks, configuration — and keep it off the hot data path.
The Most Common Mistake
Putting consensus in the critical path of user requests — often by using etcd or ZooKeeper as a primary data store or requiring a consensus write before every API response. This creates a latency floor you can never escape and makes a 3-node coordination cluster the availability bottleneck for your entire service.
Three Questions for Your Next Design Review
- If I'm using a consensus system (etcd, ZooKeeper, Consul) here, am I using it for control-plane coordination or am I accidentally making it a data store? What's the max write rate and total data volume I'm expecting?
- Does my cluster have an odd number of nodes, and do I understand how many failures it can tolerate? What happens to write availability if exactly that many nodes fail simultaneously?
- Am I relying on follower reads anywhere? If so, do I understand the consistency model those reads provide, and is stale data acceptable in that context?