Chapter 7  ·  Part II: Scalability

Replication — The Source of Most Distributed System Complexity

Keeping a copy of the same data on multiple machines sounds simple. It is not. Almost every hard problem in distributed systems — consistency, availability, conflict resolution, failover — traces back to the challenges that replication creates.

What's Coming in This Chapter

We start with why replication exists and the three main approaches: single-leader, multi-leader, and leaderless. Then we dig into the most important and most misunderstood topic in replication: replication lag — what it means, why it happens, and the bugs it silently introduces. We cover the consistency guarantees your application actually needs (read-your-writes, monotonic reads, consistent prefix reads) and what happens when writes conflict in a multi-leader setup. By the end, you will understand why "just add a replica" is never as simple as it sounds.

Key Learnings — Quick Glance

Why Bother With Replication?

Before we look at how replication works, let's be clear about why we do it. There are exactly three reasons.

Fault tolerance. If you have only one copy of your data and that machine dies, you lose the data or go offline until it comes back. A replica means you have a spare.

Latency. If your users are in Tokyo and your database is in Virginia, every read crosses the Pacific. Put a replica in Tokyo and reads become fast, even if writes still travel to Virginia.

Read throughput. One machine can only handle so many reads per second. Spread reads across ten replicas and you can handle ten times as many queries.

These goals sound simple. The problem is that whenever data changes, you need to push that change to every replica. And in a distributed system, doing that reliably, in order, without data loss, while staying fast — that is where all the hard problems live.

The Core Tension

Replication forces you to choose between two things you want simultaneously: consistency (every replica shows the same data at the same time) and performance/availability (writes complete quickly even if some replicas are temporarily unreachable). You cannot fully have both. Every replication design is a specific point on that trade-off spectrum.

Single-Leader Replication

This is the most common approach, used by PostgreSQL, MySQL, MongoDB, Kafka, and many others. The idea is straightforward: one node is designated the leader (also called primary or master). All writes go to the leader. The leader writes the change to its own storage and then sends the change to all followers (also called replicas, secondaries, or slaves). Reads can go to either the leader or any follower.

┌─────────────┐ │ LEADER │ ← all writes go here └──────┬──────┘ │ replication stream ┌────────┼────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │Follower 1│ │Follower 2│ │Follower 3│ └──────────┘ └──────────┘ └──────────┘ ↑ reads can be served from any of these

The key property of this design is that there is never a write conflict. Because all writes go through the same node in a fixed order, there is no ambiguity about which change happened first.

Synchronous vs. Asynchronous Replication

Once the leader receives a write, it needs to forward it to followers. The question is: does the leader wait for the follower to confirm before telling the client "your write succeeded"?

Synchronous replication means yes — the leader waits. This is the safe choice. If the leader crashes right after the write, the follower already has a copy. But it means every write now takes at least two network round trips (leader → follower → confirmation), and if the follower is slow, your write latency spikes. If the follower is unreachable, you either block or refuse the write entirely.

Asynchronous replication means no — the leader tells the client "done" as soon as it writes to its own storage, then forwards to followers whenever it can. Writes are fast and the leader never blocks on a slow follower. But if the leader crashes before the follower catches up, those unflushed writes are gone. The client was told their write succeeded. It did not survive.

Client Leader Follower │ │ │ │── write ─────▶│ │ │ │── replicate ─▶│ SYNC: leader waits here │ │◀── ack ───────│ │◀─── ok ───────│ │ │ │ │ Client Leader Follower │ │ │ │── write ─────▶│ │ │◀─── ok ───────│ │ ASYNC: leader responds immediately │ │── replicate ─▶│ (happens later, maybe never) │ │ │

In practice, most systems use a semi-synchronous approach: one follower is designated synchronous (the leader waits for it), the rest are asynchronous. This gives you one guaranteed durable copy without blocking on all replicas. PostgreSQL calls this synchronous_standby_names. If your synchronous follower falls behind or dies, you can promote an asynchronous one to synchronous temporarily.

A Common Misconception

Many engineers assume that because their database claims to be "replicated," their data is safe from loss. If replication is asynchronous (which is often the default for performance reasons), a leader failure before the replica catches up will lose recent writes. Check your database's default replication mode. The default is often asynchronous.

What Goes in the Replication Stream?

The leader needs to tell followers "here is what changed." There are three common approaches, and the choice has significant operational implications.

Statement-based replication sends the actual SQL statements. Simple, but fragile — NOW(), RAND(), auto-increment columns, and triggers can produce different results on the follower. MySQL used this historically and it caused subtle bugs.

Write-ahead log (WAL) shipping sends the low-level bytes that describe what changed on disk. PostgreSQL uses this. It is accurate, but it is tightly coupled to the storage engine's internal format. An upgrade that changes the storage format can break replication and makes zero-downtime upgrades harder.

Logical log replication (also called row-based) decouples from storage internals by describing changes at the row level: "row with id=5 in table users had its email field changed from X to Y." MySQL's binlog and PostgreSQL's logical replication use this. It is more portable, easier to use for external consumers like Kafka connectors, and survives storage engine upgrades.

Handling Failures

Follower failure is easy. Each follower keeps a log of what it has processed. When it comes back, it asks the leader: "give me everything since position X." The leader sends the backlog and the follower catches up. This is called catch-up recovery.

Leader failure is much harder. The process is called failover:

First, detect that the leader is dead. Most systems use a heartbeat timeout. But here is the problem: a slow leader looks the same as a dead one. Set the timeout too short and you'll declare a healthy-but-busy leader dead. Set it too long and you'll be unavailable for that entire duration after a real crash.

Second, elect a new leader. Usually the follower with the most up-to-date data is chosen. But "most up-to-date" requires agreement among the remaining nodes — and agreement in a distributed system requires consensus (Chapter 12).

Third, reconfigure clients to send writes to the new leader. And make sure the old leader — when it comes back — knows it is no longer the leader. A system where two nodes both believe they are the leader is called split-brain, and it is one of the most dangerous states a database can enter: both nodes accept writes, both diverge, and you lose the conflict guarantee that was the whole reason you used single-leader replication.

The Discarded Write Problem

When a new leader is elected, it may not have all the writes the old leader acknowledged. If the old leader's replication was asynchronous and it acknowledged 100 writes before crashing but only 95 reached the follower, those 5 writes are gone. The new leader has no record of them. If the old leader comes back online and is told "you are now a follower, sync from the new leader," it will discard those 5 writes. Clients who received "write succeeded" for those 5 operations will never know they were lost. This is why some systems — GitHub did this — have lost committed data during failover.

Replication Lag — The Hidden Complexity

In an asynchronous single-leader setup, followers will always be slightly behind the leader. Under normal conditions, this lag is measured in milliseconds. You might not even notice it. But this is where most replication bugs hide, because it leads to situations that look inconsistent to your application's users even though your database is technically working correctly.

The Problem: Reading Your Own Writes

Imagine a user updates their profile photo. The write goes to the leader. One second later, they reload the page. The read goes to a follower that has not yet received the update. They see their old photo. The user hits refresh again. Another follower, this time with the update. They see the new photo. Refresh again — old photo.

The data is not lost. It is just in transit. But to the user, it looks like the site is broken.

This is the read-your-writes problem, and it is one of the most common bugs in systems that use read replicas. The fix is to guarantee that after you write something, you always read it from a source that has that write. There are several ways to do this:

The simplest: always read your own recent writes from the leader. If a user just modified their profile, read their profile from the leader for the next 60 seconds. Everyone else can use followers. This works but requires your application to track "did this user recently write?" which is state your application now needs to carry.

A cleaner approach: track the replication position. When you write to the leader, it tells you "this write is at log position 12,345." Store that position in the user's session. When reading from a follower, only use followers that have caught up to at least position 12,345. If no follower qualifies, wait briefly or fall back to the leader.

Real-World Example

Amazon's DynamoDB and Cassandra both let callers specify a consistency level per request. LOCAL_QUORUM means "read from enough replicas in this data centre to guarantee you see the latest write." ONE means "read from the nearest single replica, which may be stale." Most of the time, ONE is fine — you are reading public product listings where a 200ms lag is invisible. But for a user checking their own account balance, you want LOCAL_QUORUM.

Monotonic Reads

Here is a subtler problem. A user makes two reads in quick succession. The first read hits Follower A, which is 2 seconds behind the leader. The second read hits Follower B, which is 8 seconds behind the leader. The user sees data that goes backwards in time. A message they just saw disappears on the next refresh.

The guarantee that prevents this is called monotonic reads: once a user has seen data at time T, they will never see older data. You can implement this by routing each user's reads to the same replica consistently — using a hash of their user ID to pick a replica, for example. If that replica fails, you fall back to another, accepting a potential time jump, but at least the experience is predictable.

Consistent Prefix Reads

This one is the most subtle. Imagine two writes happening in order: first "user A asks a question," then "user B answers that question." If you read these from a follower that received the answer before the question (because they came from different partitions with different lag), you see the answer before the question. The causal ordering is violated.

Consistent prefix reads guarantees that if a sequence of writes happens in a certain order, anyone reading those writes will see them in the same order. This is especially hard to achieve in partitioned systems where different parts of the data move at different speeds.

Guarantee The Problem It Solves Typical Fix
Read-your-writes User's own writes appear missing on re-read Route user's reads to leader for a time window, or track replication position
Monotonic reads Data appears to go backwards in time across reads Sticky routing — same user always reads from same replica
Consistent prefix reads Causally related events appear out of order Write causally related data to the same partition

None of these guarantees are free. Each one adds routing complexity, or constrains how you partition data, or forces reads to the leader. The key insight is that you must explicitly decide which guarantees you need. If you leave it to chance, you will discover the missing guarantee in production when a user files a bug report.

Multi-Leader Replication

In single-leader replication, all writes flow through one node. What if that node is in Virginia and your users are in Tokyo? Every write still crosses the Pacific. Multi-leader replication solves this: each data centre has its own leader. Writes go to the local leader, which replicates asynchronously to the other data centres' leaders.

DATA CENTRE: US-EAST DATA CENTRE: AP-TOKYO ┌──────────────┐ ┌──────────────┐ │ Leader US │◀──── async ───────▶│ Leader TK │ └──────┬───────┘ └──────┬───────┘ │ │ ┌────────┼────────┐ ┌──────────┼──────────┐ ▼ ▼ ▼ ▼ ▼ ▼ Follower Follower Follower Follower Follower Follower US-1 US-2 US-3 TK-1 TK-2 TK-3

Writes to the US leader are fast for US users. Writes to the Tokyo leader are fast for Tokyo users. The inter-datacenter replication happens in the background. This is a significant availability and latency improvement.

The cost is the problem that single-leader replication was specifically designed to avoid: write conflicts.

The Write Conflict Problem

User A is in New York. User B is in Tokyo. Both edit the same document title at the same time — User A changes it to "Blue," User B changes it to "Red." Both writes succeed locally. Then the two leaders try to sync with each other. Now each leader has a conflicting version of the same record. What should the final title be?

This is not a hypothetical edge case. In any system with multiple write points and asynchronous replication, this will happen. You need a strategy for dealing with it.

Conflict Resolution Strategies

Last-write-wins (LWW) says: use timestamps to decide — whoever wrote last wins. This sounds reasonable. It is almost always wrong. Clocks in distributed systems are not perfectly synchronized. A write at "12:00:00.100" on machine A may have actually happened after a write at "12:00:00.120" on machine B if machine A's clock is slightly ahead. You will silently drop data, and users will have no idea their write was overwritten. LWW is acceptable only when losing writes is acceptable — like caching or session data.

The longest wins / highest-version wins strategies have similar problems — they all try to make a deterministic choice from ambiguous information and silently discard data.

Merge the values works for some data types. If two users add different items to a shopping cart simultaneously, you can merge the two versions into a union. Conflict-free replicated data types (CRDTs) are data structures specifically designed to merge in a mathematically sound way. They work well for sets, counters, and ordered sequences.

Record the conflict and let the user resolve it is the honest approach when no automatic resolution is acceptable. This is what Google Docs does — it shows you the conflict and asks which version to keep. It is more code, but it does not silently discard user data.

The Real Lesson About Conflicts

The best conflict resolution strategy is to avoid conflicts in the first place. If you can route all writes for a particular user, record, or tenant to the same leader consistently, you get the latency benefits of multi-leader without most of the conflict risk. This is sometimes called sticky writes or home region routing. It does not eliminate conflicts (failovers, network partitions still cause them), but it reduces their frequency from "constantly" to "rarely."

Where Multi-Leader Is Used

Multi-leader replication is used in three main scenarios: multi-datacenter deployments (as described above), offline-capable clients (a mobile app that works offline is essentially a local leader that syncs when it reconnects — CouchDB works this way), and collaborative editing (Google Docs, Notion, Figma — multiple users editing the same document with eventual consistency between their views).

Outside these specific scenarios, multi-leader replication is rarely worth the complexity it introduces. If you are not getting a concrete benefit from having multiple write points, use single-leader.

Leaderless Replication

Both single-leader and multi-leader replication have a leader: a special node that coordinates writes. Leaderless replication removes this concept entirely. Any replica can accept writes directly from clients.

This approach was popularized by Amazon's internal Dynamo paper in 2007, and has since influenced DynamoDB, Cassandra, Riak, and Voldemort. It trades the simplicity of a single coordination point for higher write availability — you can still accept writes even when several nodes are down.

How Quorums Work

Say you have N = 5 replicas. For each write, you write to W replicas. For each read, you read from R replicas. As long as W + R > N, you are guaranteed that at least one node you read from has the latest write.

N = 5 replicas W = 3 R = 3 → W + R = 6 > N = 5 ✓ Write: Read: ┌───┐ ┌───┐ │ 1 │ ← written ✓ │ 1 │ ← read ✓ (has latest) │ 2 │ ← written ✓ │ 2 │ ← read ✓ (has latest) │ 3 │ ← written ✓ │ 3 │ ← read ✓ (has latest) │ 4 │ (not written) │ 4 │ (not read) │ 5 │ (not written) │ 5 │ (not read) └───┘ └───┘ Overlap: nodes 1, 2, 3 participated in both. At least one of R reads will always see the latest write.

Common configurations:

Config W R Characteristic
Strong consistency 3 of 5 3 of 5 Highest durability. Slower writes and reads.
Write-optimised 1 of 5 5 of 5 Fast writes. Slow reads. Useful for write-heavy event streams.
Read-optimised 5 of 5 1 of 5 Fast reads. Slow writes. Useful for config data that rarely changes.

Sloppy Quorums and Hinted Handoff

What happens if some of the nodes you need for a quorum are unreachable? You have two options.

Refuse the write until enough nodes are available. This gives you strict consistency — you never acknowledge a write that did not reach a quorum — but it sacrifices availability.

Sloppy quorum: accept the write on whatever nodes are available, even if they are not the nodes that "own" that data. Once the unavailable node comes back, the node that temporarily held its data forwards the write — this is called hinted handoff. This keeps writes going during outages, but it means W + R > N no longer guarantees you will see the latest value during the outage window. Cassandra uses sloppy quorums by default.

The Problem Quorums Don't Solve

Quorums sound like a clean mathematical guarantee. In practice, they have several failure modes that are easy to miss:

Two concurrent writes to the same key. Write A goes to nodes 1, 2, 3. Write B goes to nodes 3, 4, 5. Node 3 has both writes. Nodes 1 and 2 only have Write A. Nodes 4 and 5 only have Write B. The quorum condition is satisfied for both writes, but the replicas are now diverged. You have to resolve the conflict on reads.

A write and a read that overlap in time. Write A is in progress, reaching nodes 1, 2, and 3. A read starts, hitting nodes 1, 2, and 3. Depending on the exact timing, the read might see Write A on some nodes but not others, and has to decide what to return.

Restored nodes bringing back old data. A node fails, a write happens on the remaining nodes (satisfying the quorum), then the failed node restores from an old backup. It now has stale data. If it participates in a read quorum, it might drag the result back to an older value.

The Quorum Trap

Engineers often choose leaderless replication because "W + R > N means consistency" and then discover in production that their system behaves inconsistently. The math is sound, but the real world introduces timing, failure, and recovery scenarios the formula does not account for. If you need strict linearizability, quorums with typical Dynamo-style implementations do not give it to you. For that, you need consensus (Chapter 12), which is more expensive.

Putting It Together: Which Model Should You Choose?

There is no universally correct answer. Here is a framework for thinking about it.

Choose single-leader if you need strong consistency guarantees, your write volume fits on one machine, and you can tolerate brief write unavailability during leader failover. This is the right choice for most transactional workloads. PostgreSQL, MySQL, and MongoDB in replica set mode all use this. It is the most well-understood model with the richest tooling.

Choose multi-leader if you have users in multiple geographic regions and the latency of routing all writes to a single region is unacceptable, or if you need offline write capability. Accept that you will need a well-thought-out conflict resolution story before you start building, not after.

Choose leaderless if you need very high write availability (writes must continue even when multiple nodes are down) and your workload can tolerate eventual consistency. Cassandra is the most common choice here, often for time-series or event-logging workloads where losing or slightly reordering a small number of writes is acceptable.

Model Write Conflicts Consistency Write Availability Operational Complexity
Single-Leader None Strong (with sync) Medium (leader SPOF) Low
Multi-Leader Yes, must resolve Eventual across DCs High High
Leaderless Yes, must resolve Tunable via W/R/N Very High Medium

The Thing Most Engineers Get Wrong

Replication is often treated as an infrastructure concern — something the database handles transparently. Your application just reads and writes; the database figures out the rest.

This is true at the storage level. It is false at the application level.

Replication lag, read-your-writes, monotonic reads, conflict resolution — these are not abstractions that disappear behind a database driver. They surface as application bugs. A user who does not see their own update. A comment thread that appears out of order. A balance that briefly shows a stale value. These bugs are caused by replication, but they manifest in your product experience.

The application that reads from replicas must understand the consistency model it is operating under. It must make explicit choices: which reads need to go to the leader, which can tolerate lag, what happens when the application sees conflicting versions.

The engineers who get this right are the ones who treat the consistency model as a first-class part of their application design, not a database detail to worry about later.

Chapter Summary

The Key Principle

Replication gives you fault tolerance, latency improvements, and read scalability — but the moment you have more than one copy of data, you must make explicit choices about what happens when those copies disagree. These choices are application concerns, not just database concerns.

The Most Common Mistake

Using read replicas for all reads without accounting for replication lag, and discovering months later why users sometimes see stale data. The second most common: choosing multi-leader replication for the latency benefits without designing a conflict resolution strategy first, and then treating last-write-wins as an acceptable answer when conflicts do occur.

Three Questions for Your Next Design Review
  1. If a user writes something and immediately reads it back, is there any code path where they might see the old value — and if so, is that acceptable for this feature?
  2. If our leader fails and failover takes 30 seconds, what is the worst-case data loss — and have we verified that the application handles that gracefully?
  3. If we are using multi-leader or leaderless replication, what exactly happens when two writes to the same record conflict — and have we tested that code path?
← Previous Chapter 6: Partitioning Next → Chapter 8: Caching