The Problem with Partial Failure
When a machine crashes completely, at least the failure is obvious. The process is gone. Other services stop getting responses. Their circuit breakers trip. Alerts fire. Someone looks at a dashboard. Life is simple.
But what happens when a service doesn't crash — it just gets slow? Maybe its database is under load and queries that normally take 5 milliseconds now take 5 seconds. Maybe there's a network partition that drops one in ten packets. Maybe a downstream API is returning responses, but those responses are malformed, and parsing them consumes 100% of a CPU core.
These are partial failures. They are harder to detect, harder to diagnose, and — most importantly — far more likely to spread. A completely dead service fails fast. A slow service holds resources.
The fundamental insight is this: failures don't stay where they start. They travel upstream through resource exhaustion. Threads fill up. Connection pools drain. Memory grows. Request queues back up. What started as a single slow service eventually takes down everything that depends on it — and everything that depends on those things.
The goal of the techniques in this chapter is to contain failure to its origin. If Service B is slow, Service A should degrade gracefully — returning partial results, or a cached response, or an error it can explain to the caller — without losing its ability to handle unrelated requests.
Bulkheads: Containing the Blast
The name comes from ship design. A ship's hull is divided into sealed compartments. If the hull is breached in one section, water floods that compartment — not the whole ship. The ship stays afloat.
In distributed systems, a bulkhead is any mechanism that prevents a failure in one part of a system from consuming the shared resources of another part.
The Problem: Shared Thread Pools
The most common place where bulkheads are missing is in shared thread pools. Here is a very typical setup that many services have without realizing the risk:
If the payment service becomes slow, payment calls start holding threads. As more requests come in, more threads are consumed by waiting payment calls. Within seconds, all 200 threads are blocked waiting for payment. Now user service calls and catalog service calls — which are working perfectly fine — also start failing, because there are no threads left to run them.
Now if payment becomes slow, it exhausts its own pool of 50 threads. User calls and catalog calls continue running normally. Your service degrades in one dimension — payment is broken — but it does not fall over entirely.
Connection Pool Bulkheads
The same logic applies to database connection pools and HTTP client connection pools. If you have one connection pool shared across all database operations, a slow query that holds connections will starve fast queries of connections. Split your pools by criticality or by operation type:
Bulkheads at the Service Level
The ultimate bulkhead is a separate process. Microservices — when done for the right reasons — are a form of bulkhead. If your checkout flow and your recommendation engine run in separate services, a memory leak in recommendations doesn't take down checkout.
The trade-off is operational overhead. Every service boundary you add is a service you have to deploy, monitor, and operate. Don't add service boundaries just for isolation. Add them where the blast radius of a failure justifies the operational cost.
Bulkheads are most valuable for dependencies that are slow and unpredictable — third-party APIs, shared internal services with variable SLAs, or any dependency that does heavy computation. For dependencies that either work or fail fast (like a local cache), the bulkhead overhead often isn't worth it.
Timeouts: The Most Common Mistake
Ask an engineer if their service has timeouts on outbound calls. They will almost certainly say yes. Then ask them what the timeout values are, and why. Most of the time, the answer is some version of "I think it's 30 seconds, it came from a config file someone wrote years ago."
Timeouts are everywhere and almost universally misconfigured. Too long, or missing entirely.
Why Missing Timeouts Are a Crisis Waiting to Happen
Without a timeout, a thread waiting on a slow network call waits forever. Not for a long time. Forever. Or until the OS-level TCP keepalive kicks in — which defaults to hours on most Linux systems. During that time, the thread is occupied. It cannot serve other requests. If enough such threads pile up, your service grinds to a halt.
Many HTTP client libraries have no default timeout, or a default of several minutes. If you don't explicitly set one, you have no timeout. Always check the documentation for the specific client library you're using.
Two Timeouts You Need to Set
Most network calls have two distinct phases, and you need to set a timeout for each:
| Timeout Type | What It Covers | Typical Value | What Happens If Too Long |
|---|---|---|---|
| Connect Timeout | Time to establish the TCP connection | 1–3 seconds | Threads stuck during network partition or host unreachable |
| Read Timeout | Time to receive the first byte (or full response) after connection | Varies by operation | Threads stuck waiting on a slow or silent server |
Connect timeouts can usually be short — a few seconds. If a TCP connection isn't established in 3 seconds, something is wrong with the network path and waiting longer won't help. Read timeouts depend on what the operation does. A simple cache lookup might be 50ms. A report generation call might be 10 seconds. The point is that you must consciously decide the value, not inherit a default.
How to Pick the Right Timeout Value
A common mistake is setting timeouts based on average latency. Don't. Set them based on percentile latency — specifically, look at the p99 (the 99th percentile, meaning 99% of calls are faster than this value). Then add a small buffer.
This means you need to know your dependencies' latency distributions. If you don't have this data, getting it is the first step. Instrument every outbound call. Build a dashboard. Once you have the data, the timeout value becomes a reasoned choice rather than a guess.
The Timeout Chain Problem
Here is a subtle problem that trips up many engineers. Your system is a call chain:
Each layer should time out faster than its caller. If a user's browser will give up after 10 seconds, your frontend should give up after 8 seconds, your API service after 6 seconds, your backend service after 4 seconds. This way, by the time the user gives up, you've already freed the downstream resources. You never waste work on requests whose results will never be used.
This pattern is called deadline propagation. gRPC has built-in support for it — the original caller sets a deadline, and it gets forwarded through every hop in the chain. Each service checks the remaining time budget before doing work. If the budget is already exhausted, it skips the work and returns an error immediately. Implementing this yourself is more effort, but the idea is simple: pass a "time remaining" value through your request context.
The Timeout Principle
Every network call must have a timeout. The timeout value must be based on measured p99 latency of the dependency, not on a guess. Your timeout must be shorter than your caller's timeout. These three rules, followed consistently, prevent most slow-dependency cascades.
Circuit Breakers: Failing Fast on Purpose
Even with timeouts in place, a slow dependency is expensive. If your timeout is 200ms and the dependency is fully down, every call still waits 200ms before failing. At high traffic volumes, those 200ms stalls consume significant thread capacity and add noticeable latency.
A circuit breaker solves this by tracking the health of a dependency and, when things look bad, stopping calls to it entirely — returning an error immediately without even attempting the network call. This frees your threads instantly and lets the failing service stop receiving traffic, which gives it a chance to recover.
The name comes from electrical engineering. A circuit breaker in your home cuts power when it detects a dangerous condition — rather than letting current flow and start a fire. Once you've fixed the problem, you flip the breaker back on.
The Three States
Closed: Normal Operation
In the closed state, requests flow through normally. The circuit breaker watches the results. It counts successes and failures within a rolling time window. As long as the failure rate stays below a threshold — say, 10% — the breaker stays closed.
Open: Fail Fast
When failures cross the threshold, the breaker opens. Now, instead of making a network call, every request gets an immediate error — usually something like CircuitBreakerOpenException. No threads are held. No network packets are sent. The error comes back in microseconds.
This is good for two reasons. First, your service stays fast and responsive on requests that don't depend on the broken dependency. Second, the broken service stops receiving a flood of traffic from you — which may be exactly what it needs to recover. If your service is hammering a database that's already struggling, stopping that traffic reduces load and may allow the database to stabilize.
Half-Open: Testing Recovery
The breaker can't stay open forever. Services recover. After a configured wait period (say, 60 seconds), the breaker moves to half-open. In this state, it lets a small number of probe requests through. If those requests succeed, the service has recovered — the breaker closes and normal traffic resumes. If they fail, the service is still down — the breaker opens again and the wait period restarts.
Be deliberate about what triggers your circuit breaker. Timeouts and 5xx errors should count. But a 404 (resource not found) or a 400 (bad request) is usually a client-side error — tripping the breaker on these would mean your own bad requests mark a healthy service as down. Define your failure predicate carefully.
Circuit Breakers Are Per-Dependency
A circuit breaker is scoped to a specific downstream dependency. You have one breaker for the payment service, one for the user service, one for the inventory service. If you had a single global breaker, any misbehaving dependency would stop all outbound calls from your service — which is far too broad.
What to Return When the Breaker Is Open
When a circuit breaker trips, you have a few options:
| Strategy | When to Use | Example |
|---|---|---|
| Return cached data | Data is available and tolerable if slightly stale | Product catalog, user preferences |
| Return a default | A safe fallback exists | "Recommendations" returns an empty list, not an error |
| Fail the request | The dependency is critical and no fallback makes sense | Payment service down — tell user to try again |
| Degrade the feature | Feature is optional | Hide the "recently viewed" section entirely |
The worst option is to propagate the error silently and return data that looks correct but isn't. Always make degradation visible — either to the user with a clear message, or to the calling code with an unambiguous error.
The Circuit Breaker Principle
A circuit breaker doesn't make your system more reliable by fixing failures. It makes your system more reliable by responding to failures faster. The goal is to reduce the impact window of a failing dependency from "minutes until threads are exhausted" to "milliseconds until the breaker trips."
Retry Strategies: Don't Make It Worse
When a call fails, the instinct is to try again. This is often the right call. Many failures are transient — a brief network glitch, a server restart, a momentary load spike. A single retry resolves them cleanly.
But retries can also destroy a recovering service. Done wrong, they turn a moment of high load into an extended outage. This section is about doing them right.
Why Immediate Retries Are Dangerous
Imagine a service handles 1000 requests per second. Under sudden load, it starts failing 30% of requests. Every client retries immediately. Those retried requests become an additional 300 requests per second. The service now handles 1300 requests per second — more load than before, on an already struggling service. More requests fail. More retries. The load climbs further.
This is a retry storm. The service never gets a moment to recover, because every time it tries to shed load, the retried requests replenish it. The outage extends from seconds to minutes.
Exponential Backoff
The fix is to wait before retrying, and to wait longer each time:
With exponential backoff, the retry load decreases over time rather than compounding. The service gets some breathing room.
Jitter: The Essential Ingredient
Exponential backoff alone has a hidden problem. If 500 clients all start retrying at the same moment (because they all hit the same outage), they all back off for 100ms — and then all retry at the same time again. The load arrives in synchronized waves.
Jitter adds a random component to the delay, spreading retries across time:
AWS published a study showing that adding full jitter to backoff reduced retry collisions by an order of magnitude compared to exponential backoff without jitter. The randomness is doing real work.
Only Retry What Is Safe to Retry
This is the most important rule, and the most frequently violated. A retry is only safe if the operation is idempotent — meaning running it twice produces the same result as running it once.
| Operation | Safe to Retry? | Why |
|---|---|---|
| GET /users/123 | Yes | Read-only. Same result every time. |
| PUT /users/123 (full replace) | Yes | Setting to a known value. Running twice is fine. |
| POST /payments | No (without idempotency key) | Each call creates a new payment. Retry = double charge. |
| POST /payments with idempotency-key header | Yes | Server deduplicates by key. Safe to retry. |
| INSERT INTO orders ... | No (without deduplication) | May insert a duplicate row. |
| DELETE /sessions/abc | Yes | Deleting something already deleted is a no-op. |
For non-idempotent operations that you need to retry, the answer is to make them idempotent by design — usually through an idempotency key (a unique ID the client generates and the server uses to deduplicate). Chapter 18 covers this in full detail.
What Not to Retry
Not all errors are worth retrying. Before retrying, check what kind of error you got:
Also: if you're already past your deadline — if the caller has already timed out or the user has given up — stop retrying. Check the remaining time budget before each retry attempt. Continuing to retry work whose result will never be used is pure waste.
The Retry Amplification Problem
Here is a problem that becomes very real in deep service call chains. Say each layer retries up to 3 times. You have 3 layers. One user request at the top can generate:
The solution has two parts. First, cap retries aggressively — 2 or 3 is usually enough. Second, prefer retry budgets at the service level over per-request retry counts. A retry budget says "as a service, we will retry at most 10% of outbound calls per second." This caps total retry volume regardless of how many individual requests are failing. It's a more robust form of throttle.
The Retry Principle
Retries are not free. Every retry adds load to a system that is already struggling. Always use exponential backoff with jitter. Always check idempotency before retrying a write. Always cap the number of retries. And always stop retrying when the caller's deadline has passed.
Putting It Together: A Resilient Call Stack
These four tools — bulkheads, timeouts, circuit breakers, and retries — are not alternatives to each other. They work together, each covering a different failure mode:
| Tool | Protects Against | Key Setting |
|---|---|---|
| Bulkhead | Resource exhaustion spreading across dependencies | Thread/connection pool size per dependency |
| Timeout | Slow calls blocking threads indefinitely | Based on p99 latency; shorter than caller's timeout |
| Circuit Breaker | Repeated calls to a known-bad dependency wasting resources | Failure rate threshold, wait duration before half-open |
| Retry w/ Backoff | Transient failures that would resolve on their own | Max retries, base delay, jitter, idempotency check |
A reasonable default configuration for any outbound call:
Instrument each of these independently. You want to know: how often is the bulkhead pool at capacity? How often are timeouts firing? How often is the circuit breaker open, and for how long? How often are retries succeeding vs. exhausted? Each of these metrics tells you something different about how your dependency is behaving.
What These Tools Don't Do
It's worth being honest about the limits of this chapter's tools.
They protect you from a failing downstream. They don't fix the downstream service. If the payment service is down, circuit breakers and bulkheads mean your service stays healthy — but the payment service is still down and users still can't pay. You've contained the blast radius, but the fire is still burning.
They also don't help with data consistency. If you made a partial write before the failure, you may have left data in an inconsistent state. That's a different problem — one addressed by idempotency (Chapter 18) and the saga pattern (Chapter 13).
And they require ongoing maintenance. The "right" timeout value today may be wrong next year as your dependency's latency profile changes. Circuit breaker thresholds that work at current traffic may need tuning at 10x traffic. Build dashboards. Review settings periodically.