Imagine you walk into a design review. Someone has drawn a beautiful architecture diagram — multiple regions, synchronous replication, full ACID transactions across services, sub-10ms p99 latency, and a team of three engineers to build it.
Every individual component looks reasonable. But taken together, the design is making promises it cannot keep. Strong consistency across regions means synchronous replication, which means latency is bounded by the speed of light between datacenters. Full ACID across services requires coordination protocols that add overhead. A team of three cannot operate the complexity this architecture demands.
The designer didn't make bad choices in isolation. They ignored the tensions — the forces that, when you pull one lever, automatically push another in the wrong direction. This chapter names those forces so you stop being surprised by them.
"The goal isn't to win against the trade-offs. The goal is to decide consciously which ones you're willing to lose."
Consistency vs. Availability
This is the most talked-about tension in distributed systems, and also the most misunderstood one. Let's start with what it actually means — not the theorem, but the underlying reality.
The core problem: two copies of data that can disagree
The moment you have more than one copy of a piece of data — for redundancy, for performance, for geographic reach — you have a consistency problem. If a client writes to replica A and another client immediately reads from replica B, does replica B have the new value yet?
If you say "yes, always," you've committed to strong consistency. That commitment has a cost. If you say "eventually," you've chosen availability and performance at the cost of predictability. Neither answer is free. The question is which cost you're willing to pay.
What CAP theorem actually says
In 2000, Eric Brewer proposed what became known as the CAP theorem: a distributed system can provide at most two of three properties — Consistency, Availability, and Partition tolerance.
Consistency (C) — every read sees the most recent write or an error. Not "eventually" — immediately.
Availability (A) — every request gets a response (not an error), though it might be stale.
Partition tolerance (P) — the system continues to function when network messages between nodes are lost or delayed.
The catch is that partition tolerance is not optional. Networks do partition — messages get delayed, dropped, routed wrong. If you build a system that breaks when a network partition happens, you've built a system that will break in production. So in practice, you're always choosing between C and A during a partition.
Client writes "x = 5" to Node A.
Another client reads "x" from Node B.
CP system: Node B returns an error (cannot guarantee freshness → refuse to answer)
AP system: Node B returns "x = 3" (stale value, but at least it answers)
There is no option where Node B returns "x = 5" with zero coordination cost.
Why CAP is necessary but not sufficient
CAP describes the extreme scenario — a network partition. But most of the time your network isn't partitioned. Most of the time, nodes can talk to each other. And in that normal case, CAP says nothing. That's where a newer model called PACELC (proposed by Daniel Abadi) becomes more useful.
If there's a Partition (P): choose between Availability (A) and Consistency (C)
Else (E), when the system is running normally: choose between Latency (L) and Consistency (C)
Written as: PAC / ELC
The ELC part is the important one for most systems in practice. Even when nothing is broken, if you want strong consistency, you have to pay in latency — because coordinating across multiple nodes takes time. Every write has to be acknowledged by multiple nodes before you can tell the client "done." Every read might need to check multiple nodes to ensure it has the latest value.
This is why DynamoDB, Cassandra, and CouchDB are designed as "AP/EL" systems by default — they sacrifice some consistency to keep latency low and availability high. Meanwhile, Google Spanner and CockroachDB are "CP/EC" systems — they provide strong consistency but you pay for it in latency, especially across geographic regions.
The consistency spectrum — it's not binary
Real systems don't flip between "perfectly consistent" and "anything goes." There's a spectrum of consistency models, each with a different set of guarantees and a different cost.
| Eventual | Writes will propagate — eventually. No guarantee on when. Conflicts resolved by the system (often by timestamp or version vector). | Lowest latency |
| Causal | If operation A caused operation B, all nodes see A before B. Unrelated ops can be in any order. | Low latency |
| Read-your-writes | After you write something, you will always read your own write. Others might not see it yet. | Low-medium |
| Serializable | All transactions appear to execute in some serial (one-at-a-time) order, even though they ran in parallel. | Medium-high |
| Linearizable | Every operation appears to take effect instantaneously at some point between its start and end. The gold standard — and the most expensive. | Highest latency |
The practical skill here is matching the consistency level to what your application actually needs — not picking the strongest one "to be safe." Most applications don't need linearizability. A social media feed works fine with eventual consistency. A bank balance does not.
User profile photo update: eventual consistency is fine. If someone updates their photo and another user sees the old one for 5 seconds, nobody is harmed.
Inventory count (e-commerce): serializable at minimum. Two people buying the last item in stock at the same time must not both succeed.
Financial ledger: linearizable. The balance must reflect every debit and credit immediately and correctly, with no possibility of reading a stale intermediate state.
Shopping cart contents: read-your-writes is enough. You must see your own additions, but you don't need to see other sessions' carts in real time.
Defaulting to "eventual consistency" because it sounds cheaper, without defining what "eventual" means for your system. In some systems it means milliseconds. In others it means minutes, or "until the next anti-entropy run." "Eventual" is not a guarantee — it's a category. Before accepting eventual consistency, ask: what is the worst-case staleness window, and does my application break if reads are that stale?
Latency vs. Throughput
These two words are often used interchangeably by engineers who are new to performance thinking. They are not the same thing, and optimizing for one often hurts the other.
Latency — how long a single operation takes. Measured in milliseconds. Experienced by one user at a time.
Throughput — how many operations the system can complete per second. Measured in requests/sec, messages/sec, bytes/sec. A property of the system as a whole.
Here's a simple intuition. Imagine a highway. Throughput is how many cars cross the city per hour. Latency is how long it takes one car to get from one end to the other. A highway with narrow lanes but 20 of them might move thousands of cars per hour (high throughput) but each car is stuck in slow-moving traffic (high latency). A wide open highway at 3am has low throughput (few cars) but excellent latency for each car.
How they trade off in practice
Batching is the clearest example of the tension.
Suppose you're writing events to a database. If you write each event individually as it arrives, each write is a separate round-trip — low latency per write. But you're paying the connection and transaction overhead cost once per event.
If instead you buffer 1,000 events and write them in a single batch, your throughput (events per second written to the database) goes up dramatically. But every event now waits up to the batch window before it's persisted. Latency has increased.
E1→DB E2→DB E3→DB (3 round trips, low latency per event)
Batched writes (batch size = 3):
E1 E2 E3 → wait → [E1,E2,E3]→DB (1 round trip, higher latency per event, higher throughput)
Kafka, for example, lets you tune linger.ms (how long to wait to fill a batch)
and batch.size (max bytes per batch) — directly exposing this trade-off.
Parallelism and fan-out introduces a different angle.
If a user request requires fetching data from five different services, you can do it sequentially (low parallelism, easy to reason about) or in parallel (lower latency, but now you need to coordinate five concurrent operations). Parallel fan-out reduces latency — but the total system throughput might not increase if the downstream services become your bottleneck.
Worse: with fan-out, your latency is now determined by the slowest of the five services, not the average. This is where tail latency becomes critical.
Tail latency: the number that actually matters
Most engineers monitor average latency. Average latency is a lie.
Imagine 99 requests that complete in 10ms and one request that takes 10,000ms. The average is about 110ms. But one percent of your users are waiting ten seconds. If your application makes multiple backend calls per user request, the chance that at least one of them hits that slow outlier grows quickly.
If one service call has a 1% chance of hitting a slow outlier (p99 latency),
and a single user request makes 100 independent calls to that service,
the probability that at least one call hits the slow case is:
1 − (0.99)^100 ≈ 63%
Your average latency looks fine. But over half your users are experiencing the worst case.
This is why large-scale systems obsess over p99 and p999, not averages.
Tail latency is caused by many things: garbage collection pauses, lock contention, disk seeks, context switches, network jitter. You often can't eliminate the cause — but you can design around it.
One technique is hedged requests: you send the same request to two replicas simultaneously, and use whichever responds first. You're spending extra throughput (you're making 2x requests to get 1x responses) to buy better tail latency. It's a direct trade of one for the other.
Little's Law: the math behind queues
There's a beautiful, simple equation that governs every queue in every system — from a database connection pool to a Kafka consumer group to a checkout line at a grocery store.
L = λ × W
Where:
L = average number of items in the system (queue depth)
λ (lambda) = average arrival rate (items per second)
W = average time each item spends in the system (latency)
Rearranged: W = L / λ
What this means in practice: if your system has a queue of 1,000 requests and your throughput is 100 requests/sec, each request spends an average of 10 seconds waiting. If you want to reduce that latency to 1 second without changing arrival rate, you have to reduce queue depth to 100 — which means either processing faster or shedding load.
This law is why latency degrades sharply as a system approaches its throughput limit. At 50% capacity, your queues are short and latency is fine. At 90% capacity, small bursts cause queues to spike. At 100%, latency grows without bound. The system looks healthy right up until it doesn't.
Running a system at high utilization because "it looks fine." A system running at 85% capacity has almost no headroom for traffic spikes. When a spike hits, the queue depth explodes and latency blows up — often before your auto-scaling has time to react. Design for 50-60% steady-state utilization so you have room to absorb bursts.
Simplicity vs. Capability
This is the tension engineers talk about least and underestimate most. It doesn't show up in benchmarks. It doesn't have a theorem named after it. But it is responsible for more production incidents, more team burnout, and more failed projects than the other two tensions combined.
"Complexity is the enemy of reliability. A system you don't fully understand is a system you can't fully trust."
What complexity actually costs
Every piece of complexity you add to a system has a recurring tax. That tax is paid:
In understanding. Every new engineer who joins the team must build a mental model of the system. Each component they don't understand is a component they can't debug, can't improve, and can't safely change. Complex systems make onboarding slow and mistakes more likely.
In operations. Every extra moving part is something that can fail at 3am. Kafka, ZooKeeper, a service mesh, a caching layer, a message router, a schema registry — each one is a service that needs monitoring, tuning, upgrading, and incident response. Complexity compounds.
In blast radius. Simple systems fail in simple ways. Complex systems fail in complex ways. An interaction between component A and component B that nobody anticipated, triggered by a timing condition that only appears under load — these are the incidents that last 12 hours and produce 40-page post-mortems.
The capability seduction
Capability is seductive because it's visible. A new feature, a higher throughput number, support for a new use case — these are things you can show on a slide. The cost of the complexity that came with them is invisible until it isn't.
This is why systems accumulate complexity over time. Each individual addition is justified. The team needed that feature. That performance improvement was real. That edge case had to be handled. But the sum of ten individually reasonable decisions can be an unreasonably complex system.
A team adds Redis as a caching layer in front of their database. Read latency drops by 80%. Great result.
Then they add cache warming logic for cold starts. Then a TTL management system for different data types.
Then a cache invalidation service for when data changes. Then a fallback strategy for when Redis is down.
Then monitoring for cache hit rate, eviction rate, memory pressure.
Each addition was justified. But now the "simple caching layer" is a distributed system in its own right,
with its own failure modes, its own operational burden, and a dozen subtle consistency bugs waiting to be discovered.
The original latency problem could have been solved by a read replica — one moving part instead of six.
The two types of complexity
Fred Brooks, in his 1986 paper No Silver Bullet, distinguished between two kinds of complexity in software. The distinction is just as useful in systems design.
Essential complexity — complexity that is inherent in the problem itself. A payment processing
system that handles fraud, refunds, multi-currency, and chargebacks is genuinely complex. You can't make
it simple without making it wrong.
Accidental complexity — complexity that you introduced through your choices. A clever
abstraction that nobody understands. A protocol that handles cases that never occur. An architecture
that was designed for 100x current scale on day one.
Your job as a systems designer is to minimize accidental complexity while absorbing the essential complexity with as clean a design as possible. But you can only do this if you've taken the time to understand what the essential complexity of your problem actually is — not what you imagine it might be in some hypothetical future.
You ain't gonna need it (YAGNI) at the systems level
YAGNI is usually talked about in the context of code. Don't build the feature you might need. Write the code you need today. The same principle applies at the architecture level, but the stakes are higher because architectural decisions are harder to undo than code decisions.
"We might need to support multiple regions someday" is not a reason to build a multi-region architecture today if you have 500 users and two engineers. "We might need to process 10 billion events per day" is not a reason to build a Kafka cluster today if you currently process 50,000.
The right question is not "what might we need?" but "what do we need to be true in order to grow from where we are to the next meaningful scale threshold, and is the complexity of building it now worth more than the cost of adding it later?"
Sometimes the answer is yes — database schema decisions, API contracts, and data model choices are hard to change later and worth thinking through carefully upfront. But most of the time, the answer is no. The cost of unnecessary complexity today is certain. The value of the capability you're buying is speculative.
"Designing for scale" without a specific, near-term reason to believe you'll need it. This produces systems that are complex to build, hard to operate, and often wrong — because the requirements at 1000x scale turn out to be different from what you imagined at 1x scale. It's almost always cheaper to build a simple system now and redesign a specific component when you actually hit the scaling constraint.
Surfacing Your Real Constraints
All three tensions share a common root cause when they go wrong: the team didn't take the time to articulate their actual constraints before making design decisions. They optimized for a property they didn't need, neglected one they did, or added capability that nobody was going to use.
Before you design anything, answer these questions in writing. Not in your head — in writing, because writing forces precision.
The constraint worksheet
| Constraint Type | The question to ask | Why it matters |
|---|---|---|
| Consistency | If a user reads stale data, what is the worst thing that happens? Is it a minor annoyance or a correctness violation? | Determines which consistency model you actually need |
| Availability | What is the business impact of 5 minutes of downtime? 1 hour? Can users retry, or is the window of time critical? | Sets your real availability target — often much less than 99.99% |
| Latency | At what latency does the user experience become unacceptable? Is this a human-facing request or a background job? | A background job that runs in 30 seconds doesn't need sub-100ms p99 |
| Throughput | What is your peak load today? What is a reasonable projection for 12 months? For 24 months? | Determines whether you need partitioning, sharding, or whether a single node is fine |
| Operational load | How many engineers will own this system? What is their operational bandwidth? | A three-person team cannot operate a 12-component distributed system effectively |
| Durability | What happens if you lose the last N minutes of data? Can you reconstruct it? Is it acceptable? | Determines replication factor and write confirmation requirements |
Once you have real answers to these questions, the right architecture often becomes obvious. Most systems have one or two genuinely tight constraints and several where a simple solution is fine. The work is figuring out which is which.
When you're choosing between two designs and they trade off against each other:
1. Name the trade-off explicitly — "Option A gives us lower latency but weaker consistency. Option B gives us strong consistency but higher latency."
2. Map each option to your constraints — Does the weaker consistency of Option A violate any of your consistency requirements? If not, it might be fine.
3. Estimate the cost of being wrong — If you choose Option A and later realize you needed stronger consistency, how hard is it to switch? If the answer is "rebuild from scratch," you need more confidence before choosing.
4. Write it down — Put the trade-off and your reasoning in the design doc. Six months from now, when someone asks "why did we do it this way?", the answer exists.
Chapter Summary
Every distributed system design is a set of conscious trade-offs between consistency, availability, latency, throughput, and simplicity. There is no globally optimal design — only the right design for your specific constraints.
Designing to maximise all properties simultaneously — or defaulting to the strongest guarantees "to be safe" — without checking whether those guarantees are actually required. The cost of over-engineering is real. So is the cost of under-engineering. The only way to avoid both is to know your constraints before you design.
1. What is the worst thing that happens if a user reads stale data for 5 seconds? (Consistency requirement)
2. What is the system's expected load 12 months from now, and does this architecture handle it without a rewrite? (Throughput requirement)
3. How many engineers will operate this system at 2am? Does the system's complexity match their operational bandwidth? (Simplicity requirement)