What's in this chapter
Before you can design a distributed system correctly, you need a vocabulary for talking about what "correct" even means. This chapter builds that vocabulary from the ground up, using four foundational ideas.
- The Two Generals Problem — a proof that certainty is impossible over unreliable channels, and what that means every time you write a distributed operation.
- The Happens-Before Relation — why physical clocks cannot tell you the order of events, and what to use instead.
- Linearizability vs. Serializability — two terms that are constantly confused, precisely explained. Knowing the difference changes how you evaluate databases.
- The Dual of a System — a mental model for making guarantees explicit: for every guarantee a system provides, name what it gives up.
Key Learnings — Read This First
You can never know for certain whether a message was delivered across an unreliable network. The Two Generals proof makes this mathematically precise.
Physical clocks lie. Two servers' clocks can be seconds apart even after NTP sync. Never use wall-clock time to decide the order of events in a distributed system.
Happens-Before is the only reliable ordering. Event A happens before B if A could have influenced B — same process, or a message passing between them.
Linearizability is about individual operations appearing instantaneous. Serializability is about transactions appearing to run one-at-a-time. They are completely different properties.
Concurrent events are normal. If neither A → B nor B → A, those events are concurrent — neither caused the other. Conflict resolution is your problem, not an edge case.
Every guarantee has a cost. Linearizability costs latency. Serializability costs throughput. Naming the cost upfront prevents you from adding guarantees you don't need.
Exactly-once is not a property of a single system. It is a contract between a sender and a receiver, built out of at-least-once delivery plus idempotent processing.
Most real applications don't need the strongest guarantees. They need weaker but carefully chosen ones. This chapter helps you identify which ones you actually need.
Why This Chapter Exists
Imagine you are building a payment system. A user clicks "Pay". Your service sends a request to the bank's API. The request goes over the network. Then — nothing. No response. The connection times out.
Did the payment go through or not? You genuinely do not know. The network could have dropped your request before it arrived. Or the bank could have processed it and the response got lost on the way back. Or the bank could be slow and the response is still coming. These three situations are indistinguishable from your service's point of view.
This is not a bug. This is the fundamental nature of distributed systems. The tools in this chapter help you reason precisely about situations like this, so you can design systems that behave correctly even when you cannot tell what happened.
Most bugs in distributed systems are not logic errors. They are reasoning errors — the programmer assumed something about ordering, timing, or delivery that isn't actually guaranteed. Precise vocabulary is the first line of defense.
The Two Generals Problem
This is one of the oldest and most important results in distributed computing. It was first described in 1975 and it proves something that sounds surprising: it is impossible to achieve certainty about a shared action when you are communicating over an unreliable channel.
The Story
Two armies, commanded by General A and General B, have surrounded a city. They can only capture it if they attack at the same time. If only one attacks, they lose.
The two generals can only communicate by sending messengers through the city. But the city might capture the messenger. So any message might not arrive.
General A wants to agree on "attack at dawn tomorrow." So A sends a messenger: "Attack at dawn. Confirm you received this."
Now suppose the messenger gets through. General B receives it. B sends a messenger back: "Confirmed." But now B faces the same problem — will A receive the confirmation? If B's messenger is captured, then A didn't get the confirmation and might not attack. So B cannot be certain either.
General A General B
│ │
│ ── "Attack at dawn. Confirm." ──► │
│ │ B receives. But can B be sure
│ │ A will know B got it?
│ │
│ ◄── "Confirmed. Confirm you got this" ── │
│ │
│ A receives. But can A be sure B will │
│ know A got the confirmation? │
│ │
│ ── "Got your confirmation. Confirm" ──► │
│ │
│ ... this goes on forever ... │
No matter how many rounds of confirmation you add,
the last message is always unconfirmed.
The final actor always has uncertainty about whether
their last message arrived. There is no protocol that
solves this with a finite number of messages.
The key insight is that the problem does not go away with more messages. No matter how many rounds of acknowledgment you add, the last message is always unconfirmed. The last person to send a message cannot know if the other side received it.
You cannot solve the Two Generals Problem by being cleverer. There is a formal proof that no protocol using a finite number of messages can guarantee agreement over an unreliable channel. Better hardware does not help. More retries do not help. The only way out is to change the problem — typically by weakening the requirement from "certain" to "probable enough for your use case."
What This Means in Practice
Every time your service makes a call to another service and waits for a response, you are living inside the Two Generals Problem. When the timeout fires, you are in the exact position of General A whose last messenger has not returned.
This has real engineering consequences:
You cannot write code that says "if we didn't hear back, the operation failed." The operation may have succeeded. Your request may have arrived, been processed, and only the response got lost. If you retry, you might execute the operation twice.
This is why idempotency matters so much. If your operation is idempotent — if doing it twice has the same effect as doing it once — then retrying on timeout is safe. The Two Generals Problem does not go away, but its consequences become acceptable. We will spend an entire chapter on idempotency later in this book. But it starts here.
This is also why distributed transactions are hard. A two-phase commit protocol tries to get multiple nodes to agree on whether to commit or abort. But when the coordinator crashes mid-protocol, the participants are stuck — they received a "prepare" message but not a "commit" or "abort". They cannot know what the coordinator decided. They are General B waiting for confirmation that will never arrive.
The Two Generals Problem teaches you to ask a different question. Instead of "how do I guarantee this operation happened exactly once?", ask "how do I make the system behave correctly whether this operation happened zero or one times?" That shift — from certainty to graceful handling of uncertainty — is the foundation of good distributed system design.
The Happens-Before Relation
Here is a question that sounds simple but is surprisingly tricky: if two events happen on two different computers, which one happened first?
Your first instinct is probably to look at the timestamps. Event A happened at 10:00:00.421 on server 1. Event B happened at 10:00:00.419 on server 2. So B happened before A, right?
Wrong. You cannot trust this.
Why Physical Clocks Cannot Be Trusted
Every computer has a clock. But clocks drift. Two computers that were perfectly synchronized at noon will be slightly different by 12:01. NTP (Network Time Protocol) corrects for this, but it cannot make clocks perfect — it just reduces the error. In practice, servers in the same data center can have clock skew of tens of milliseconds. Across data centers, skew can be in the hundreds of milliseconds.
So when you compare timestamps from two different machines, you are comparing two imprecise measurements. You might conclude B happened before A when actually A happened first, just on a clock that was running slightly ahead.
Server 1 (clock running slightly fast) ───────────────────────────────────────────────────── 10:00:00.000 Write X=1 10:00:00.100 Write X=2 ← timestamp says this is later Server 2 (clock running slightly slow) ───────────────────────────────────────────────────── 09:59:59.950 Read X ← timestamp says this is earlier 10:00:00.050 Write X=3 If you order by timestamp: 09:59:59.950 Read X (Server 2) 10:00:00.000 Write X=1 (Server 1) 10:00:00.050 Write X=3 (Server 2) 10:00:00.100 Write X=2 (Server 1) But causally, Write X=1 on Server 1 may have caused Write X=3 on Server 2 — and that ordering is now lost.
This is not a theoretical concern. Real systems have had data corruption bugs because they sorted events by timestamp and made the wrong conclusion about ordering. The problem is especially nasty because it works fine most of the time — it only bites you when clocks happen to be skewed at exactly the wrong moment.
Lamport's Solution: Logical Clocks
In 1978, Leslie Lamport published a paper called "Time, Clocks, and the Ordering of Events in a Distributed System." It is one of the most cited computer science papers ever written. Lamport's insight was simple but profound: you do not need to know the real time. You only need to know the causal order of events.
He defined the happens-before relation, written as A → B, which means "A happened before B and could have influenced B." Three rules define it:
Rule 1 (Same process): If A and B are events in the same process, and A comes before B in that process's execution, then A → B.
Rule 2 (Message passing): If A is the sending of a message and B is the receipt of that same message, then A → B.
Rule 3 (Transitivity): If A → B and B → C, then A → C.
That is it. Three rules. No physical clocks involved.
With these rules you can construct a logical clock: a counter that each process increments before every event, and includes in every message it sends. When a process receives a message, it sets its own counter to the maximum of its current counter and the received counter, then increments before processing. This gives every event a number, and those numbers respect the happens-before relation.
Process A Process B Process C
────────── ────────── ──────────
L=1 event a1
L=2 event a2 ──── msg(L=2) ────►
L=3 event b1 (max(0,2)+1)
L=4 event b2 ──── msg(L=4) ────►
L=5 event c1 (max(0,4)+1)
L=3 event a3
L=5 event b3
Key property:
If A → B then L(A) < L(B)
But the converse is NOT guaranteed:
L(A) < L(B) does NOT mean A → B
Events with no happens-before relationship are concurrent.
Neither caused the other. Both orderings are "valid."
Concurrent Events: The Normal Case
When neither A → B nor B → A, we say A and B are concurrent. This does not mean they happened at the exact same physical instant. It means neither one could have influenced the other — there was no chain of messages connecting them.
Concurrent events are not an edge case. In any actively used distributed system, most writes happening across different nodes are concurrent. The question of "which one wins" when two concurrent writes conflict with each other is a design decision, not a fact about the world.
This is why databases have to make explicit choices about conflict resolution. "Last write wins" is one answer — but it requires clocks, and we just established that clocks lie. Vector clocks can track concurrent versions and surface the conflict for the application to resolve. CRDTs (Conflict-free Replicated Data Types) define merge operations so that concurrent writes always produce a deterministic result. Each approach involves a real trade-off, and the choice starts with understanding what "concurrent" actually means.
Lamport clocks tell you if A causally precedes B (if L(A) < L(B), maybe). But they cannot tell you if two events are definitely concurrent. Vector clocks fix this. Each process tracks a vector of counters, one per process. When you compare two vector clocks, you can tell exactly whether one event caused another, or whether they are concurrent. The cost: each message carries O(n) data, where n is the number of processes. For large clusters, this gets expensive — which is why most practical systems use approximate methods, or design around the need for vector clocks in the first place.
Linearizability vs. Serializability
These two terms are among the most misused in distributed systems. They sound similar, they are both described as "the strongest" consistency guarantee in their respective domains, and they are completely different things. Getting them confused leads to real bugs.
Serializability: Transactions Appear to Run One at a Time
Serializability is about transactions — operations that touch multiple objects. It is a property of a database's transaction isolation.
A database provides serializability if the result of any concurrent execution of transactions is the same as the result of some serial (one-at-a-time) execution of those transactions. The serial order does not have to match real time — it just has to exist.
Let's make this concrete. Two transactions T1 and T2 run at the same time. T1 transfers $100 from account A to account B. T2 reads the balance of both accounts to display to the user. If the database is serializable, there is some ordering — either "T1 then T2" or "T2 then T1" — that produces the same result as what actually happened. The user either sees both accounts before the transfer or both after. They never see account A debited but account B not yet credited.
Serializability does not say anything about which serial order was used. It only says one exists. It also does not say anything about the real-time order in which transactions completed — T2 might have started and finished before T1, but serializability allows the outcome to be as if T1 ran first.
Linearizability: Single Operations Appear Instantaneous
Linearizability is about individual operations on a single object (like a register, a key-value entry, or a queue). It is a property of a distributed system's consistency, not specifically about transactions.
A system is linearizable if every operation appears to take effect atomically at a single point in time, between when the operation was invoked and when it completed. And the order of these points must be consistent with real-time order: if operation A completed before operation B started, then A must appear before B in the linearized order.
The key difference from serializability: linearizability cares about real time. If you write a value at 10:00:00 and another client reads at 10:00:01, linearizability guarantees that the read sees your write (or something even newer). Serializability gives no such guarantee — the "serial order" chosen by the database can place that read before your write.
Client 1: write(x=1) ─────────────────────► done at T=5 Client 2: read(x) ──────► ? starts at T=3, ends at T=7 Client 3: read(x) ► ? starts at T=6, ends at T=8 Linearizable system: - Client 2's read overlaps with the write. It may return 0 or 1. Both are valid. (The "linearization point" of the write can fall before or after Client 2's read.) - Client 3's read starts AFTER the write completed. It MUST return 1. (Because the write's linearization point must be before Client 3's start.) Non-linearizable (but "eventually consistent") system: - Client 3 might still return 0, because it read from a replica that hasn't received the write yet. - This is the root cause of stale reads in eventually consistent databases.
The Crucial Difference: One Example
Imagine a distributed lock. You acquire the lock, do some work, and release it. Another client then acquires it. With a linearizable system, the second client is guaranteed to see all changes made by the first client while it held the lock. With a serializable-but-not-linearizable system, this is not guaranteed — the second client might be "placed" in the serial order before the first client's writes become visible.
This is why distributed coordination requires linearizability, not just serializability. Leader election, distributed locks, and unique ID generation all need linearizability because they depend on the real-time ordering of operations. If two nodes both try to acquire a lock and the system is only serializable, both might succeed — which defeats the purpose of having a lock.
| Property | Serializability | Linearizability |
|---|---|---|
| Scope | Multi-object transactions | Single-object operations |
| Ordering | Some serial order exists (ignores real time) | Order must match real-time wall clock |
| Domain | Database isolation levels | Distributed system consistency models |
| Typical use | ACID databases, financial transactions | Distributed locks, leader election, coordination |
| Cost | Throughput (locking or MVCC overhead) | Latency (must contact a quorum or leader) |
| CAP theorem | Not directly addressed | Consistency in CAP = Linearizability |
Strict Serializability: Both at Once
If a system provides both serializability and linearizability, it is called strictly serializable. This is the strongest guarantee you can get. Spanner (Google's globally distributed database) provides strict serializability. It does so using synchronized atomic clocks and a protocol called TrueTime — which we will examine in Chapter 15. The cost is latency: every commit must wait for the clock uncertainty window to pass before returning.
Most databases do not provide strict serializability. PostgreSQL's serializable isolation provides serializability but not linearizability. DynamoDB's strong consistency reads provide linearizability for single-item operations but not serializability across items. Knowing which you have — and which you need — prevents entire classes of bugs.
The Dual of a System
This last mental model is not a formal result — it is a thinking habit. But it might be the most practically useful idea in this chapter.
The idea is simple: for every guarantee a system provides, there is something it gives up. Guarantees are not free. Name the cost explicitly, and you have a much more honest picture of what a system actually is.
A Few Examples
Linearizability. The guarantee: reads always see the most recent write. The cost: every read must contact a quorum or go through a single leader. This means reads are slower, and during a network partition, the system must become unavailable rather than serve potentially stale data. This is the C and A trade-off in CAP.
Serializability. The guarantee: transactions appear to run one at a time. The cost: the database must detect and abort conflicting transactions, or hold locks until commit. Both approaches reduce the number of transactions that can run concurrently. Throughput goes down as contention goes up.
Durability. The guarantee: once you say a write succeeded, the data survives a crash. The cost: the write must be flushed to disk (or replicated to another node) before acknowledging success. Latency increases. If you want faster writes, you are accepting that some acknowledged writes might be lost on a crash.
Exactly-once delivery. The guarantee: a message is processed exactly one time. The cost: the system must store enough state to detect duplicates. This means a deduplication store, a window of time over which duplicates are tracked, and coordination overhead. "Exactly-once" is also a lie at the network level — what you actually get is at-least-once delivery plus idempotent processing that together appear exactly-once to the consumer.
When someone adds a guarantee to a design document, ask: "What does this cost?" If they cannot answer, the guarantee has not been fully thought through. When someone removes a guarantee to improve performance, ask: "What breaks?" If they cannot answer, the removal has not been fully thought through. The dual of a system keeps both questions visible at the same time.
Applying This to Real Systems
Here is where this mental model becomes practically powerful. When you are choosing between databases, message queues, or consistency models, you are really choosing between different dualities. Framing it this way forces honesty.
Consider a team choosing between a strongly consistent database (like Spanner) and an eventually consistent one (like Cassandra). A naive framing: "Spanner is better because it's consistent." The dual framing: "Spanner guarantees consistent reads but costs you latency and availability during partitions. Cassandra guarantees low latency and always-on availability but costs you the ability to make assumptions about what you'll read back." Neither is better. It depends on what your application actually needs.
The same applies within a single system. When you decide to cache query results, the guarantee you gain is fast reads. The dual — what you give up — is that reads might be stale. When you decide to use optimistic locking instead of pessimistic locking, the guarantee you gain is higher throughput under low contention. The dual is that under high contention, transactions will frequently retry and throughput actually degrades.
Systems often promise a guarantee and hide its dual in the fine print. "Exactly-once message delivery" sounds complete, but the fine print is "within a 24-hour deduplication window." After 24 hours, the same message can be processed again. If your consumer processes payments and a message is delayed by 25 hours due to a bug, you might charge a customer twice. The guarantee was real; you just did not read the dual carefully.
Always look for the deduplication window, the consistency scope, the failure case that the guarantee does not cover. That is where the dual lives.
Putting It Together: A Framework for Evaluating Systems
When you evaluate a new system — a database, a message queue, a coordination service — run through these questions:
1. What consistency model does it provide? Linearizability? Serializability? Causal consistency? Eventual consistency? If the documentation does not say, that is a red flag. The answer is probably "whatever was convenient to implement."
2. What happens during a network partition? Does the system become unavailable? Does it serve stale data? Does it accept writes on both sides and merge them later? Every system makes a choice here. Find out what it is.
3. What is the happens-before scope? Within a single node? Within a single partition? Globally? A system might be linearizable within a partition but not across partitions. A write to partition A and a write to partition B have no guaranteed ordering.
4. What failure modes does the guarantee not cover? This is the dual. Every guarantee has a scenario where it does not hold. Find that scenario. If you cannot find it, you have not read the documentation carefully enough.
Bringing It Together
These four mental models — the Two Generals Problem, happens-before, linearizability vs. serializability, and the dual of a system — are not independent ideas. They form a single coherent framework for thinking about distributed systems.
The Two Generals Problem tells you that certainty is impossible. So you design for handling uncertainty gracefully.
The happens-before relation tells you that real time is unreliable. So you track causality explicitly through message passing, not timestamps.
Linearizability vs. serializability gives you a precise vocabulary for what "correct" means in different contexts, so you can evaluate whether a system provides what you actually need.
The dual of a system reminds you that every guarantee has a cost. So you only pay for guarantees you actually need — and you know exactly what you are giving up when you choose a weaker one.
With these four tools, you can look at a distributed system — any distributed system — and ask productive questions about it. That is what the rest of this book does. Each part examines a different aspect of distributed systems, but the reasoning always comes back to these foundations.
You may have heard of the CAP theorem (Consistency, Availability, Partition-tolerance: pick two). We have deliberately not made it the centerpiece of this chapter, because CAP is often taught in a way that oversimplifies. The "C" in CAP specifically means linearizability — not just any consistency. And partition tolerance is not really optional: partitions happen in any network, so "picking" consistency and availability without partition tolerance means you are just designing for a world where the network never fails. We will examine CAP carefully and honestly in Chapter 16.
The Key Principle of This Chapter
The hardest part of distributed systems is not writing the code — it is building the right mental model of what the system can and cannot promise, and designing around the gaps rather than pretending they are not there.
The Most Common Mistake
Treating a timeout as proof of failure. When a network call times out, you know the call did not complete within your deadline — you do not know whether the operation itself succeeded or failed. Code that retries on timeout without idempotency, or that marks an operation as failed without a way to reconcile later, is one of the most common sources of data inconsistency in distributed systems.
Three Questions for Your Next Design Review
- 01 If this service calls another service and gets a timeout, what does the calling code do? Does it retry? Does it treat the operation as failed? Is the operation idempotent? What happens if it actually succeeded?
- 02 What consistency model does each database or data store in this design provide? Is it what you need, or are you assuming stronger guarantees than the system actually gives you?
- 03 For each guarantee this design relies on, name its dual — what does the system give up to provide that guarantee? Is the cost acceptable for this use case?