The Failure Taxonomy
Not all failures look alike. A machine that crashes is easy to reason about. A machine that silently drops every tenth message is much harder. A machine that sends deliberately wrong answers is hardest of all. The system you build depends entirely on which failures you're defending against.
What's in this chapter
- The five classes of failure — and why the distinction matters for design
- Crash failures: simple, detectable, and still surprisingly tricky
- Omission failures: the invisible enemy that looks like success
- Timing failures: when correctness has a deadline attached
- Byzantine failures: when a node lies to you
- Human failures: the most common class, and the least designed-for
- How failures compound and cascade across a distributed system
- A practical decision guide: what to design for given your actual threat model
Key Learnings — Read This First
Short on time? Read these before anything else. Come back to the full chapter when a specific type bites you in production.
- 1 Most systems only plan for crash failures, which are the easiest kind to handle. Omission and timing failures are far more common in production and far more insidious.
- 2 A slow node is harder to handle than a dead node. A dead node stops sending responses. A slow node keeps sending them — just too late to be useful — and your timeout logic has to figure out the difference.
- 3 Omission failures are silent by definition. No error is raised. No exception is thrown. The message was sent — it just never arrived. Your system has to be designed to notice the absence of a response, not the presence of an error.
- 4 You almost certainly don't need to defend against Byzantine failures in a data center environment with trusted machines. But you do need to plan for them at system boundaries: third-party APIs, user-supplied data, and hardware at the end of its life.
- 5 Human failures cause the majority of outages. Yet most distributed systems are designed exclusively around machine failures. Runbooks, access controls, deployment guards, and feature flags are failure-tolerance mechanisms just as much as redundancy and replication are.
- 6 Failures compound. A timing failure under load causes a retry storm that causes an omission failure on a downstream service that causes a crash failure on its database. Design your failure model around the cascade, not the individual component.
- 7 Your failure model is a contract. The algorithms you choose, the guarantees you provide, and the SLAs you promise are all only valid within the bounds of the failure model you declared. Step outside it and all bets are off.
Why You Need a Vocabulary for Failure
Here's a question that sounds simple: "What happens when a node in your system fails?"
If you're like most engineers, your instinct is to say something like "the node goes down, we detect it, we reroute traffic." That's a fine answer for one specific kind of failure. But "failure" is not one thing. A node can fail in radically different ways, and each of them requires a different defensive strategy. Using one word for all of them is like a doctor using one word — "sick" — for both a broken leg and a viral infection. The word is technically accurate, but it doesn't tell you what treatment to apply.
The failure taxonomy we're about to walk through isn't academic. Every design decision you make about retries, timeouts, replication, and consistency assumes something about the kind of failures you're protecting against. If your assumptions are wrong, your design is solving the wrong problem.
In a single-machine program, failure is usually all-or-nothing. The process crashes, or it doesn't. In a distributed system, you have many components, and any subset of them can fail independently while the rest keep running. This is called a partial failure, and it's the defining characteristic — and fundamental difficulty — of distributed systems. All of the failure types below are really categories of partial failure.
Crash Failures — The Easy Case
A crash failure is the simplest kind: a node stops working and never does anything again. It doesn't send incorrect data. It doesn't behave intermittently. It just goes quiet.
Crash failures are sometimes called fail-stop failures, because the node stops — and importantly, other nodes can eventually figure out that it stopped. If you send a request to a crashed node and it never responds, you time out, declare it dead, and move on.
What causes them
The obvious ones: a power outage hits the data center, the operating system kernel panics, a process runs out of memory and the OOM killer terminates it, a hardware failure causes the disk controller to go offline. These are clean failures. The machine is gone.
Less obvious: a process can effectively crash from a software perspective while the machine is still physically running. A Java process that gets stuck in a full garbage collection pause for 60 seconds is functionally crashed from the perspective of anyone waiting on it — even though the OS reports it as running.
Why they're the "easy" case
Crash failures are easy to reason about because they satisfy a useful property: a crashed node doesn't lie. It doesn't send you the wrong data. It doesn't acknowledge a write it didn't actually persist. It doesn't vote on both sides of a consensus decision. The moment it crashes, it stops participating entirely.
This makes them tractable. Want to survive one node crashing? Replicate to at least two. Want to survive two crashing simultaneously? Replicate to three. The math is clean.
A pure crash failure assumes the node is gone forever. Real systems have crash-recovery failures: a node crashes, loses its in-memory state, and comes back online. Now you have a node that is running, but behind on events, potentially replaying old data, and unaware of what it committed before the crash. This is significantly harder to handle correctly. We cover the recovery side in depth in the chapters on replication (Ch 7) and consensus (Ch 12).
The detection problem
Even the simple crash failure has one non-obvious trap: you cannot reliably distinguish a crashed node from a very slow node or a network partition. When you send a request and don't get a response, you don't know whether the node crashed, the request got lost in the network, the response got lost in the network, or the node is just very slow and is still computing. From where you're sitting, all of these look identical: silence.
This is why timeouts don't tell you what happened — they only tell you that too much time passed. What you do after a timeout depends on your assumptions about which of those scenarios is most likely. We'll come back to this when we discuss timing failures below.
Omission Failures — The Invisible Enemy
An omission failure occurs when a component fails to respond to some requests — not all of them, just some. The node is still running. It still processes many requests correctly. But a subset of messages get dropped somewhere along the way: in the sender's buffer, in the network, or in the receiver's input queue.
This is more dangerous than a crash failure, because there is no signal. No exception is thrown. No error is returned. The message was sent — your code returned successfully after the send() call — and then it vanished. The only way to know something went wrong is to notice that you never received an expected reply.
Where messages actually get dropped
Understanding where drops happen helps you understand how to defend against them:
- Send buffer overflow: If a node is producing messages faster than the network can carry them, the OS send buffer fills up. New messages get dropped silently.
send()may still return success, depending on the socket mode you're using. - Network congestion: Routers drop packets when their queues fill up. TCP will retransmit these, but only up to its own timeout. UDP drops them with no recovery at all. Even TCP connections can get reset during extreme congestion.
- Receive buffer overflow: If the receiving process isn't reading from its socket fast enough, the receive buffer fills up. The operating system starts dropping incoming packets. Again, no error is returned to the sender.
- Firewall or load balancer rules: A firewall rule changed mid-deployment, or a load balancer is routing certain request types to a backend that can't handle them. Some requests make it through, others don't.
- Process-level drops: A server that is overloaded may accept the TCP connection (so the sender thinks the request was received) and then simply never process it — the request sits in the application's work queue until it times out and is discarded.
The partial write that looked like a success
A service writes to Kafka. The Kafka client's send() call returns a Future that resolves successfully. The service moves on. But Kafka was configured with acks=1, meaning only the leader acknowledged the write. Before the followers replicated it, the leader crashed. The message is gone. No error was ever raised. The service has no idea. This is a send omission failure combined with replication lag — and it's entirely silent.
How to design around omissions
The fundamental defense against omission failures is acknowledgement plus retry. You treat a message as delivered only when the receiver explicitly confirms it. If confirmation doesn't arrive within a timeout, you resend. This sounds simple; making it correct is not.
The immediate problem with naive retry is that you might deliver the message twice: once originally (it did arrive, the acknowledgement just got lost) and once on the retry. This is why idempotency — making it safe to process the same message more than once — is foundational to distributed systems design. We dedicate a full chapter to it (Ch 18) because it's that important.
Teams running stream processing pipelines often assume their event log is complete because no errors were recorded. In an at-least-once delivery system, errors are only raised for messages that the system knew about. Silently dropped messages never enter the system and never generate an error. If your pipeline doesn't have end-to-end accounting (reconciling input record counts with output record counts), you may have been quietly losing data for months.
Timing Failures — When Late Is Wrong
In a real-time or time-sensitive system, a response that arrives too late is functionally wrong — even if the data in it is perfectly correct. A navigation system that tells you to turn right 500 meters after the turn is not a working navigation system. A stock trade confirmation that arrives after market close is not a confirmation that matters.
But timing failures matter even in systems without hard real-time requirements. Any system that uses timeouts to detect failures is implicitly relying on timing assumptions. And those assumptions are almost always more fragile than people think.
The deep problem with time in distributed systems
Every node in your system has its own clock. Those clocks drift apart. They can jump backwards when NTP synchronizes them. A virtual machine can be paused by the hypervisor for seconds (or minutes) and then resume with no idea that time passed. A process can be paused by a stop-the-world garbage collector. A kernel can delay a process for an arbitrarily long time due to CPU scheduling.
The consequence is profound: you cannot use wall-clock time to determine the order of events across different machines. If machine A records an event at 10:00:00.000 and machine B records an event at 10:00:00.001, you cannot conclude that A's event happened first. A's clock might be running 10 milliseconds fast. B's NTP update might have just corrected a drift. You simply don't know.
"The most insidious timing failures are the ones your system was designed to tolerate — at the wrong threshold."
— A recurring lesson from large-scale incident post-mortemsSynchronous vs. asynchronous systems
A synchronous distributed system is one where every message is delivered within a known, bounded time. Every node responds within a bounded time. Clocks drift by no more than a known amount. In such a system, timing failures are well-defined and detectable: if a response doesn't arrive within the bound, something went wrong.
An asynchronous distributed system makes none of these guarantees. Messages can take arbitrarily long to arrive. Nodes can pause for arbitrarily long periods. There are no bounds.
The internet, and most modern cloud infrastructure, is asynchronous. Your network might deliver 99.99% of messages in under 10ms. But occasionally it takes 30 seconds. Occasionally it takes minutes. You can't rule it out. This means timeouts — which you absolutely need — are always making a guess, not a measurement. When a timeout fires, you know that too much time passed. You don't know why, and you don't know the state of the request at the other end.
The timeout dilemma
Setting the right timeout value is harder than it looks:
- Too short: You declare nodes dead that are merely slow. Under load, when nodes slow down, your system starts treating them as failed, removing them from the pool, increasing load on the remaining nodes, which then slow down further. This is a textbook cascade into total failure.
- Too long: Your failure detection is sluggish. When something does fail, your system takes a long time to recover. Users wait, queues build up, and the downstream effects spread while you wait out your timeout.
The right answer is not a fixed number. It is a timeout calibrated to your system's actual response time distribution, with headroom for the tail. Many teams set timeouts based on average latency, then get surprised when p99 latency is 10x the average under load. Your timeout needs to account for the tail, not the average.
A reasonable starting point: set your timeout to 2–3× your p99 latency under expected load. Then test it under 2× expected load, because that's when you'll actually need it. If your p99 at normal load is 80ms but 600ms under peak load, a 200ms timeout will cause cascade failures at exactly the moment you can least afford them.
Clocks and ordering
Timing failures also appear in a subtler form when you try to use timestamps to establish event ordering. Say two services both write to a shared database with a "last write wins" conflict resolution policy. Service A writes at 12:00:00.100 and service B writes at 12:00:00.090. B's timestamp is earlier, so A's write wins — right?
Only if their clocks are perfectly synchronized. In practice, B's clock might be 50ms fast relative to A's. Then B's write at 12:00:00.090 might have physically happened after A's write at 12:00:00.100. You just threw away the more recent data. This isn't a theoretical concern — it's a well-documented production failure mode in systems that use wall-clock timestamps for ordering. Chapter 15 covers the alternatives: logical clocks and hybrid logical clocks, which don't have this problem.
Byzantine Failures — When a Node Lies to You
A Byzantine failure is the most general and most dangerous kind. A Byzantine node can do anything: send different answers to different nodes, send data that is internally inconsistent, selectively drop messages to some nodes but not others, respond correctly most of the time and incorrectly at rare intervals, or actively try to manipulate the system's behavior.
The name comes from the Byzantine Generals Problem, a thought experiment from 1982. Imagine several army generals surrounding a city, who need to agree on a plan: attack or retreat. They communicate by messenger. Some of the generals are traitors who will send different messages to different generals to cause disagreement. The question is: can the loyal generals reach agreement despite the traitors?
The answer is yes — but only if fewer than one third of the generals are traitors, and only with specific algorithms. If a third or more are traitors, agreement is impossible. This mathematical result has direct implications for distributed systems: defending against Byzantine failures is fundamentally more expensive than defending against crash failures.
What actually causes Byzantine behavior in practice
Byzantine failures aren't just academic. In real systems, they arise from:
- Hardware errors: Cosmic rays flipping bits in RAM (no, really — this has caused Bitcoin mining pools to accept invalid blocks). Faulty NIC cards that corrupt packets before the checksum is computed. SSDs that silently return wrong data (sometimes called "bit rot").
- Software bugs: A database that returns stale cached data for some queries but fresh data for others. A service with a race condition that produces inconsistent responses to concurrent requests. A library that behaves differently under memory pressure than under normal conditions.
- External inputs you don't control: A third-party API that returns inconsistent data during a degraded state. A user who sends carefully crafted input designed to exploit your system.
- Malicious actors: Relevant mainly for blockchains, financial systems, and multi-tenant infrastructure, but worth naming explicitly.
When do you actually need to defend against Byzantine failures?
The honest answer: rarely in the interior of a standard data center environment. If you're running your own Kubernetes cluster on machines you own and trust, Byzantine fault tolerance (BFT) is probably not your problem. The overhead of BFT consensus algorithms is significant — they require at least 3f+1 nodes to tolerate f faulty nodes, and their message complexity is much higher than crash fault-tolerant algorithms like Raft.
However, you almost certainly do need Byzantine-style defenses at these specific boundary points:
- Any input from users or external callers. Never trust data that crosses a system boundary. Validate it, sanitize it, and treat it as potentially adversarial.
- Third-party service integrations. An external API can start returning unexpected data formats, inconsistent state, or errors disguised as successes. Your integration layer should validate responses before trusting them.
- Hardware at end of life or under stress. Aging storage systems, overheating machines, and hardware with detected (but not yet repaired) errors can produce silent data corruption. If you're storing anything important, checksums and data integrity validation are your Byzantine defense.
- Multi-tenant or multi-party systems. If your system involves parties with conflicting interests who can influence each other's outputs, Byzantine tolerance is a hard requirement. Blockchains exist precisely because of this.
For most systems, you don't need full Byzantine fault-tolerant consensus. What you do need is: end-to-end checksums on stored data (so silent corruption is detected), input validation at system boundaries (so malformed data is caught early), and idempotency keys (so a system sending duplicate or conflicting requests can't corrupt your state). These three practices protect you against the Byzantine failures you're actually likely to encounter.
Human Failures — The One Everyone Forgets
Every taxonomy of distributed systems failures lists crash, omission, timing, and Byzantine failures. Almost none of them prominently list human failures. This is a significant omission, because study after study of real-world outages finds that the majority of serious incidents involve a human action as either the direct cause or a contributing factor.
Human failures include: running a database migration script on production instead of staging. Deploying a config change that disables authentication on an internal endpoint. Accidentally deleting a storage bucket. Running a backfill job that unexpectedly saturates your database's write throughput. Changing a feature flag during a traffic spike. Pushing an emergency hotfix that bypasses testing.
These aren't failures of competence. The engineers who caused them were, in virtually every post-mortem, experienced people making a reasonable decision in a stressful moment with incomplete information. The failures were failures of system design — the system didn't make the dangerous action hard enough to do by accident.
Why human failures are different
Machine failures are random. A disk fails based on its error rate and age. A network packet drops based on congestion. These failures are not correlated with what your system is doing at the time — they're essentially background noise.
Human failures are not random. They cluster around deployments, maintenance windows, incident responses, and high-pressure moments. This is exactly when your system is already stressed. A human failure during an incident response can turn a P3 into a P1. The failure arrives precisely when you're least equipped to handle it.
Designing for human failures
The good news is that human failures are among the most tractable to design around, because they follow predictable patterns:
- Make dangerous operations require explicit confirmation. A
DROP TABLEthat requires you to type the table name. A deletion API that requires an extra confirmation field. A deploy script that prints "You are about to deploy to PRODUCTION. Type 'yes' to continue." - Separate environments aggressively. Staging and production should have different credentials, different access controls, and ideally different enough naming that a confused human realizes something is off. "prod-us-east-db.internal" versus "staging-db.internal" is better than "db1" versus "db2".
- Soft deletes before hard deletes. Instead of deleting data immediately, mark it as deleted and keep it for 30 days. This turns a catastrophic human error into a recoverable one.
- Rate limiting and blast radius controls on operational tools. A backfill script shouldn't be able to consume more than 20% of database capacity by default. A migration shouldn't be able to lock a table for more than 1 second at a time.
- Canary deployments and feature flags. A bad deploy that only touches 1% of traffic is a human-recoverable event. A bad deploy that touches 100% of traffic simultaneously might not be.
- Runbooks that work under stress. The steps for handling an incident should be short, numbered, and executable by someone who is tired and scared. Prose documentation that requires judgment is documentation that will be misread at 3am.
The incident that started with a typo
An engineer responding to a slow database runs a query to identify locked transactions. They type the wrong table name. The query runs an unintentional full scan on the largest table in the database, saturating I/O. The database slows further. Other engineers, seeing the slow database, kick off their own diagnostic queries. Within three minutes, a slow database has become a completely unresponsive one. The original incident was a 5-minute annoyance. The human failure cascade turned it into a 90-minute outage. The fix: query timeout guardrails on the production database that prevented diagnostic queries from running for more than 10 seconds.
Compound Failures — The Real Danger
We've described five failure types as if they're separate categories. In production, they rarely arrive alone. The most serious outages in distributed systems are almost always the result of multiple failures interacting with each other — where each individual failure is survivable, but the combination is not.
This is sometimes called a failure cascade or a gray failure. The system degrades gradually, with each small failure making the next one more likely, until a tipping point is reached and the whole thing collapses suddenly. From the outside, it looks like the system "suddenly went down." From the inside, there were hours of warning signs that no single alarm crossed its threshold.
A realistic failure cascade
Consider this sequence — every step is individually survivable:
- A background batch job runs during peak traffic hours (human failure: nobody set a schedule constraint). It saturates the database's read I/O.
- The database slows down. Read latencies climb from 5ms to 200ms (timing failure begins).
- A downstream service has a 150ms timeout on database reads. Some requests now time out. The service retries. Each retry generates a new database query. Database load increases further.
- The retry storm overwhelms the database's connection pool. New connections start being refused. Some requests see a connection refused error (omission failure).
- Connection failures are not handled gracefully by one service — it crashes (crash failure).
- A health check declares that service unhealthy and removes it from the load balancer. Traffic redistributes to the remaining instances, which are now handling 50% more load each.
- The remaining instances, already stressed, start timing out on their own downstream calls. The cascade spreads laterally.
The root cause was a poorly scheduled batch job. The failure mode that actually took down the system was the retry storm amplifying database load. Fixing only the root cause doesn't fix the system — you also need exponential backoff with jitter on retries, connection pool limits, and database query timeouts with maximum concurrency controls.
A system is metastable when it has a stable state under low load and a different stable state under high load — and a self-reinforcing feedback loop can tip it from one to the other. Retry storms are the classic example: load increases → latency increases → timeouts fire → retries happen → load increases further. The system can't recover on its own even after the original cause is removed, because the retries are the cause now. The only way out is to actively reduce load — by dropping requests, shedding retries, or adding capacity rapidly.
Your Failure Design Guide
Given all of the above, here is a practical framework for deciding what you need to design for. Start by answering these questions honestly about your system:
| Failure Type | You must design for this if... | Core defense | Common mistake |
|---|---|---|---|
| Crash | Any component is stateful, or availability matters at all | Replication and automatic failover | Detecting crashes too slowly (timeout too long) |
| Omission | You use any network call, queue, or async messaging | Acknowledgements, retries, idempotency | Assuming send() success means delivery |
| Timing | Always — all distributed systems have timing issues | Calibrated timeouts, backpressure, logical clocks for ordering | Setting timeouts based on average latency, not tail latency |
| Byzantine | External inputs, third-party APIs, multi-party systems, long-lived storage | Input validation, checksums, Byzantine consensus where necessary | Designing for trusted internals but forgetting external boundaries |
| Human | Always — humans are in the loop in every system | Soft deletes, confirmation guards, rate limits on ops tools, runbooks | Treating this as a "process problem" rather than a design problem |
The failure model is a contract
Every algorithm that provides a distributed systems guarantee — Raft, Paxos, two-phase commit, every consistency model — is only correct within an assumed failure model. Raft guarantees consensus among honest nodes assuming crash failures. It does not guarantee anything if a node starts sending conflicting votes to different peers (Byzantine behavior). ZooKeeper provides linearizable reads assuming clocks are roughly synchronized — run it on VMs with wildly drifting clocks and your applications may see anomalies.
When you choose an algorithm or a component, you are implicitly signing a contract that says: "I will not expose this component to failure modes outside its assumed model." Step outside that model — even briefly, even accidentally — and the guarantees disappear. This is not a failure of the algorithm. It is a failure to honor the contract.
Write your failure model down. Explicitly. Include it in your design document. "This system is designed to tolerate crash failures among up to f replicas simultaneously. It assumes omission failures are handled at the transport layer by TCP. It does not tolerate Byzantine failures within the cluster, but validates all inputs at the ingress boundary." Stating this clearly is not bureaucratic overhead. It is the specification that tells you when your guarantees are valid and when they are not.
The failure type determines the defense. Designing a system without naming the failures it tolerates is like writing code without specifying what inputs are valid — correct by accident at best.
Designing exclusively for crash failures (easy, visible, clean) while ignoring omission and timing failures (silent, partial, hard to reproduce in tests). Most production incidents are timing or omission failures, or human failures cascading through a system that was only designed for crashes.
- What happens if this component becomes slow but doesn't crash — does your system detect it, and what does it do?
- Is every message in this system acknowledged? What is the retry behavior, and is every operation idempotent?
- What is the most likely human failure mode here, and have you made it at least somewhat hard to do by accident?