Part III — Fault Tolerance
Chapter 11

Designing for Partial Failure

In a distributed system, things will break. Not everything at once — just one service, or one network link, or one database connection. The question is never "will something fail?" It is "when something fails, how much of the system fails with it?" This chapter is about keeping the answer small.

What's in this chapter

Key Learnings at a Glance

Short on time? These are the ideas you shouldn't leave this chapter without.

A single slow dependency — not a dead one — is the most dangerous failure mode. Slow consumes resources. Dead fails fast.

Bulkheads isolate failures to their origin. If Service A and Service B share a thread pool, a problem in A can starve B. Separate the pools.

Every network call must have a timeout. No exceptions. A missing timeout is a thread that waits forever and a cascade waiting to happen.

Your timeout should be shorter than your caller's timeout. If caller waits 5 seconds, you must time out in under 5 seconds — otherwise the caller gives up and your request is wasted work.

Circuit breakers fail fast when a service is down. Instead of waiting for a timeout, they return an error immediately — freeing threads and giving the downstream service breathing room to recover.

Immediate retries hurt recovering services. Use exponential backoff with jitter. Without jitter, every client retries at the same time and creates a new spike.

Only retry idempotent operations. Retrying a payment or a database insert that already succeeded creates duplicates — which is often worse than the original failure.

Retry amplification is real. If each layer retries 3 times and you have 3 service layers deep, one user request becomes 27 backend requests. Plan for this.

The Problem with Partial Failure

When a machine crashes completely, at least the failure is obvious. The process is gone. Other services stop getting responses. Their circuit breakers trip. Alerts fire. Someone looks at a dashboard. Life is simple.

But what happens when a service doesn't crash — it just gets slow? Maybe its database is under load and queries that normally take 5 milliseconds now take 5 seconds. Maybe there's a network partition that drops one in ten packets. Maybe a downstream API is returning responses, but those responses are malformed, and parsing them consumes 100% of a CPU core.

These are partial failures. They are harder to detect, harder to diagnose, and — most importantly — far more likely to spread. A completely dead service fails fast. A slow service holds resources.

How a slow dependency cascades
User → Frontend → API Gateway → Service AService B (slow, 8s response) Service A has 200 threads. Each request to Service B holds a thread for 8 seconds. At 25 requests/second, all 200 threads fill up in 8 seconds. Service A stops responding to everything — even requests that don't touch Service B. Frontend starts queuing requests. API Gateway starts queuing requests. The entire system is now down because one downstream service got slow.

The fundamental insight is this: failures don't stay where they start. They travel upstream through resource exhaustion. Threads fill up. Connection pools drain. Memory grows. Request queues back up. What started as a single slow service eventually takes down everything that depends on it — and everything that depends on those things.

The goal of the techniques in this chapter is to contain failure to its origin. If Service B is slow, Service A should degrade gracefully — returning partial results, or a cached response, or an error it can explain to the caller — without losing its ability to handle unrelated requests.


Bulkheads: Containing the Blast

The name comes from ship design. A ship's hull is divided into sealed compartments. If the hull is breached in one section, water floods that compartment — not the whole ship. The ship stays afloat.

In distributed systems, a bulkhead is any mechanism that prevents a failure in one part of a system from consuming the shared resources of another part.

The Problem: Shared Thread Pools

The most common place where bulkheads are missing is in shared thread pools. Here is a very typical setup that many services have without realizing the risk:

Without Bulkheads — Dangerous
# A single shared thread pool handles ALL outbound calls thread_pool = ThreadPool(size=200) # Service A calls three different downstream services def handle_request(req): # All three share the same 200 threads user = thread_pool.submit(call_user_service, req.user_id) payment = thread_pool.submit(call_payment_service, req.order_id) catalog = thread_pool.submit(call_catalog_service, req.item_id) return merge(user, payment, catalog)

If the payment service becomes slow, payment calls start holding threads. As more requests come in, more threads are consumed by waiting payment calls. Within seconds, all 200 threads are blocked waiting for payment. Now user service calls and catalog service calls — which are working perfectly fine — also start failing, because there are no threads left to run them.

With Bulkheads — Isolated
# Separate thread pool per downstream dependency user_pool = ThreadPool(size=50) payment_pool = ThreadPool(size=50) catalog_pool = ThreadPool(size=50) def handle_request(req): user = user_pool.submit(call_user_service, req.user_id) payment = payment_pool.submit(call_payment_service, req.order_id) catalog = catalog_pool.submit(call_catalog_service, req.item_id) return merge(user, payment, catalog)

Now if payment becomes slow, it exhausts its own pool of 50 threads. User calls and catalog calls continue running normally. Your service degrades in one dimension — payment is broken — but it does not fall over entirely.

Connection Pool Bulkheads

The same logic applies to database connection pools and HTTP client connection pools. If you have one connection pool shared across all database operations, a slow query that holds connections will starve fast queries of connections. Split your pools by criticality or by operation type:

Separate connection pools by criticality
# Critical path — small, fast, tightly timeboxed critical_pool = ConnectionPool(max_size=20, timeout_ms=100) # Background / analytics — larger, tolerant of latency background_pool = ConnectionPool(max_size=10, timeout_ms=5000)

Bulkheads at the Service Level

The ultimate bulkhead is a separate process. Microservices — when done for the right reasons — are a form of bulkhead. If your checkout flow and your recommendation engine run in separate services, a memory leak in recommendations doesn't take down checkout.

The trade-off is operational overhead. Every service boundary you add is a service you have to deploy, monitor, and operate. Don't add service boundaries just for isolation. Add them where the blast radius of a failure justifies the operational cost.

Design Insight

Bulkheads are most valuable for dependencies that are slow and unpredictable — third-party APIs, shared internal services with variable SLAs, or any dependency that does heavy computation. For dependencies that either work or fail fast (like a local cache), the bulkhead overhead often isn't worth it.


Timeouts: The Most Common Mistake

Ask an engineer if their service has timeouts on outbound calls. They will almost certainly say yes. Then ask them what the timeout values are, and why. Most of the time, the answer is some version of "I think it's 30 seconds, it came from a config file someone wrote years ago."

Timeouts are everywhere and almost universally misconfigured. Too long, or missing entirely.

Why Missing Timeouts Are a Crisis Waiting to Happen

Without a timeout, a thread waiting on a slow network call waits forever. Not for a long time. Forever. Or until the OS-level TCP keepalive kicks in — which defaults to hours on most Linux systems. During that time, the thread is occupied. It cannot serve other requests. If enough such threads pile up, your service grinds to a halt.

Common Mistake

Many HTTP client libraries have no default timeout, or a default of several minutes. If you don't explicitly set one, you have no timeout. Always check the documentation for the specific client library you're using.

Two Timeouts You Need to Set

Most network calls have two distinct phases, and you need to set a timeout for each:

Timeout Type What It Covers Typical Value What Happens If Too Long
Connect Timeout Time to establish the TCP connection 1–3 seconds Threads stuck during network partition or host unreachable
Read Timeout Time to receive the first byte (or full response) after connection Varies by operation Threads stuck waiting on a slow or silent server

Connect timeouts can usually be short — a few seconds. If a TCP connection isn't established in 3 seconds, something is wrong with the network path and waiting longer won't help. Read timeouts depend on what the operation does. A simple cache lookup might be 50ms. A report generation call might be 10 seconds. The point is that you must consciously decide the value, not inherit a default.

How to Pick the Right Timeout Value

A common mistake is setting timeouts based on average latency. Don't. Set them based on percentile latency — specifically, look at the p99 (the 99th percentile, meaning 99% of calls are faster than this value). Then add a small buffer.

Timeout = p99 latency + small buffer
# Example: Inventory service latency data p50_latency = 12ms p95_latency = 45ms p99_latency = 120ms # Bad: Set timeout based on p50 — kills 50% of calls during any load spike timeout = 15ms # ← Too aggressive # Bad: Set timeout at 30 seconds "to be safe" timeout = 30000ms # ← Threads stack up for 30s when service goes slow # Good: p99 + buffer. Catches real outliers, fails fast on real problems. timeout = 200ms # ← 120ms p99 + 80ms buffer

This means you need to know your dependencies' latency distributions. If you don't have this data, getting it is the first step. Instrument every outbound call. Build a dashboard. Once you have the data, the timeout value becomes a reasoned choice rather than a guess.

The Timeout Chain Problem

Here is a subtle problem that trips up many engineers. Your system is a call chain:

Timeout chain — the wrong way
User (browser, 10s patience) → Frontend (timeout: 10s on API call) → API service (timeout: 10s on backend call) → Backend service (timeout: 10s on database) → Database goes slow... Timeline: t=0: User makes request t=10: Database times out t=10: Backend service tries to respond, but API service already timed out at t=10 too t=10: Frontend times out simultaneously All three layers waste their full 10 seconds of resources on a request whose result will never be used — the user gave up at t=10.

Each layer should time out faster than its caller. If a user's browser will give up after 10 seconds, your frontend should give up after 8 seconds, your API service after 6 seconds, your backend service after 4 seconds. This way, by the time the user gives up, you've already freed the downstream resources. You never waste work on requests whose results will never be used.

This pattern is called deadline propagation. gRPC has built-in support for it — the original caller sets a deadline, and it gets forwarded through every hop in the chain. Each service checks the remaining time budget before doing work. If the budget is already exhausted, it skips the work and returns an error immediately. Implementing this yourself is more effort, but the idea is simple: pass a "time remaining" value through your request context.

The Timeout Principle

Every network call must have a timeout. The timeout value must be based on measured p99 latency of the dependency, not on a guess. Your timeout must be shorter than your caller's timeout. These three rules, followed consistently, prevent most slow-dependency cascades.


Circuit Breakers: Failing Fast on Purpose

Even with timeouts in place, a slow dependency is expensive. If your timeout is 200ms and the dependency is fully down, every call still waits 200ms before failing. At high traffic volumes, those 200ms stalls consume significant thread capacity and add noticeable latency.

A circuit breaker solves this by tracking the health of a dependency and, when things look bad, stopping calls to it entirely — returning an error immediately without even attempting the network call. This frees your threads instantly and lets the failing service stop receiving traffic, which gives it a chance to recover.

The name comes from electrical engineering. A circuit breaker in your home cuts power when it detects a dangerous condition — rather than letting current flow and start a fire. Once you've fixed the problem, you flip the breaker back on.

The Three States

Circuit Breaker State Machine
CLOSED
failures exceed threshold
OPEN
after timeout period
? HALF-OPEN
CLOSED ──→ OPEN when: failure rate exceeds threshold (e.g. 50% failures in a 10-second window)
OPEN ──→ HALF-OPEN when: configured wait period expires (e.g. 60 seconds — giving the service time to recover)
HALF-OPEN ──→ CLOSED when: probe requests succeed — the service has recovered, traffic resumes
HALF-OPEN ──→ OPEN when: probe requests fail — the service is still broken, wait again

Closed: Normal Operation

In the closed state, requests flow through normally. The circuit breaker watches the results. It counts successes and failures within a rolling time window. As long as the failure rate stays below a threshold — say, 10% — the breaker stays closed.

Open: Fail Fast

When failures cross the threshold, the breaker opens. Now, instead of making a network call, every request gets an immediate error — usually something like CircuitBreakerOpenException. No threads are held. No network packets are sent. The error comes back in microseconds.

This is good for two reasons. First, your service stays fast and responsive on requests that don't depend on the broken dependency. Second, the broken service stops receiving a flood of traffic from you — which may be exactly what it needs to recover. If your service is hammering a database that's already struggling, stopping that traffic reduces load and may allow the database to stabilize.

Half-Open: Testing Recovery

The breaker can't stay open forever. Services recover. After a configured wait period (say, 60 seconds), the breaker moves to half-open. In this state, it lets a small number of probe requests through. If those requests succeed, the service has recovered — the breaker closes and normal traffic resumes. If they fail, the service is still down — the breaker opens again and the wait period restarts.

What Counts as a Failure?

Be deliberate about what triggers your circuit breaker. Timeouts and 5xx errors should count. But a 404 (resource not found) or a 400 (bad request) is usually a client-side error — tripping the breaker on these would mean your own bad requests mark a healthy service as down. Define your failure predicate carefully.

Circuit Breakers Are Per-Dependency

A circuit breaker is scoped to a specific downstream dependency. You have one breaker for the payment service, one for the user service, one for the inventory service. If you had a single global breaker, any misbehaving dependency would stop all outbound calls from your service — which is far too broad.

What to Return When the Breaker Is Open

When a circuit breaker trips, you have a few options:

StrategyWhen to UseExample
Return cached data Data is available and tolerable if slightly stale Product catalog, user preferences
Return a default A safe fallback exists "Recommendations" returns an empty list, not an error
Fail the request The dependency is critical and no fallback makes sense Payment service down — tell user to try again
Degrade the feature Feature is optional Hide the "recently viewed" section entirely

The worst option is to propagate the error silently and return data that looks correct but isn't. Always make degradation visible — either to the user with a clear message, or to the calling code with an unambiguous error.

The Circuit Breaker Principle

A circuit breaker doesn't make your system more reliable by fixing failures. It makes your system more reliable by responding to failures faster. The goal is to reduce the impact window of a failing dependency from "minutes until threads are exhausted" to "milliseconds until the breaker trips."


Retry Strategies: Don't Make It Worse

When a call fails, the instinct is to try again. This is often the right call. Many failures are transient — a brief network glitch, a server restart, a momentary load spike. A single retry resolves them cleanly.

But retries can also destroy a recovering service. Done wrong, they turn a moment of high load into an extended outage. This section is about doing them right.

Why Immediate Retries Are Dangerous

Imagine a service handles 1000 requests per second. Under sudden load, it starts failing 30% of requests. Every client retries immediately. Those retried requests become an additional 300 requests per second. The service now handles 1300 requests per second — more load than before, on an already struggling service. More requests fail. More retries. The load climbs further.

This is a retry storm. The service never gets a moment to recover, because every time it tries to shed load, the retried requests replenish it. The outage extends from seconds to minutes.

Exponential Backoff

The fix is to wait before retrying, and to wait longer each time:

Exponential backoff
def call_with_retry(fn, max_retries=4, base_delay_ms=100): for attempt in range(max_retries): try: return fn() except TransientError as e: if attempt == max_retries - 1: raise # out of retries, give up delay = base_delay_ms * (2 ** attempt) # attempt 0: wait 100ms # attempt 1: wait 200ms # attempt 2: wait 400ms # attempt 3: wait 800ms sleep(delay)

With exponential backoff, the retry load decreases over time rather than compounding. The service gets some breathing room.

Jitter: The Essential Ingredient

Exponential backoff alone has a hidden problem. If 500 clients all start retrying at the same moment (because they all hit the same outage), they all back off for 100ms — and then all retry at the same time again. The load arrives in synchronized waves.

Jitter adds a random component to the delay, spreading retries across time:

Exponential backoff with jitter
import random def backoff_with_jitter(attempt, base_ms=100, max_ms=10000): exponential = base_ms * (2 ** attempt) capped = min(exponential, max_ms) # cap so it doesn't grow forever return random.uniform(0, capped) # random value in [0, cap] # Result: clients spread their retries across the window # instead of spiking all at once

AWS published a study showing that adding full jitter to backoff reduced retry collisions by an order of magnitude compared to exponential backoff without jitter. The randomness is doing real work.

Only Retry What Is Safe to Retry

This is the most important rule, and the most frequently violated. A retry is only safe if the operation is idempotent — meaning running it twice produces the same result as running it once.

OperationSafe to Retry?Why
GET /users/123 Yes Read-only. Same result every time.
PUT /users/123 (full replace) Yes Setting to a known value. Running twice is fine.
POST /payments No (without idempotency key) Each call creates a new payment. Retry = double charge.
POST /payments with idempotency-key header Yes Server deduplicates by key. Safe to retry.
INSERT INTO orders ... No (without deduplication) May insert a duplicate row.
DELETE /sessions/abc Yes Deleting something already deleted is a no-op.

For non-idempotent operations that you need to retry, the answer is to make them idempotent by design — usually through an idempotency key (a unique ID the client generates and the server uses to deduplicate). Chapter 18 covers this in full detail.

What Not to Retry

Not all errors are worth retrying. Before retrying, check what kind of error you got:

Retry only on appropriate errors
def should_retry(error): if isinstance(error, TimeoutError): return True # transient — retry if isinstance(error, ConnectionError): return True # network blip — retry if error.status_code == 503: # Service Unavailable return True # server asked us to retry if error.status_code == 429: # Too Many Requests return True # rate limited — back off and retry if error.status_code == 400: # Bad Request return False # your request is malformed — retrying won't help if error.status_code == 401: # Unauthorized return False # credentials are wrong — retrying won't help if error.status_code == 404: # Not Found return False # resource doesn't exist — retrying won't help return False # default: don't retry unknown errors

Also: if you're already past your deadline — if the caller has already timed out or the user has given up — stop retrying. Check the remaining time budget before each retry attempt. Continuing to retry work whose result will never be used is pure waste.

The Retry Amplification Problem

Here is a problem that becomes very real in deep service call chains. Say each layer retries up to 3 times. You have 3 layers. One user request at the top can generate:

Retry multiplication through layers
User request (1) → Frontend retries 3× against API service = 3 requests → API service retries 3× against Backend = 9 requests → Backend retries 3× against Database = 27 database calls One user request generates 27 database calls during an outage. With 1000 users retrying: 27,000 database calls/second — guaranteed to extend the outage.

The solution has two parts. First, cap retries aggressively — 2 or 3 is usually enough. Second, prefer retry budgets at the service level over per-request retry counts. A retry budget says "as a service, we will retry at most 10% of outbound calls per second." This caps total retry volume regardless of how many individual requests are failing. It's a more robust form of throttle.

🔄

The Retry Principle

Retries are not free. Every retry adds load to a system that is already struggling. Always use exponential backoff with jitter. Always check idempotency before retrying a write. Always cap the number of retries. And always stop retrying when the caller's deadline has passed.


Putting It Together: A Resilient Call Stack

These four tools — bulkheads, timeouts, circuit breakers, and retries — are not alternatives to each other. They work together, each covering a different failure mode:

ToolProtects AgainstKey Setting
Bulkhead Resource exhaustion spreading across dependencies Thread/connection pool size per dependency
Timeout Slow calls blocking threads indefinitely Based on p99 latency; shorter than caller's timeout
Circuit Breaker Repeated calls to a known-bad dependency wasting resources Failure rate threshold, wait duration before half-open
Retry w/ Backoff Transient failures that would resolve on their own Max retries, base delay, jitter, idempotency check

A reasonable default configuration for any outbound call:

Sane defaults for a resilient outbound call
# 1. Bulkhead: dedicated thread pool, sized to traffic pool = ThreadPool(size=30) # 2. Timeout: based on measured p99 + buffer timeout_ms = 250 # 3. Circuit breaker: trip on >40% failures in 10s, wait 30s before probing breaker = CircuitBreaker( failure_threshold=0.4, window_seconds=10, wait_seconds=30, ) # 4. Retries: only on transient errors, max 2, exponential backoff + jitter retry_policy = RetryPolicy( max_attempts=3, backoff_base_ms=50, jitter=True, retryable_on=[TimeoutError, ConnectionError, HTTP503], )
Good Practice

Instrument each of these independently. You want to know: how often is the bulkhead pool at capacity? How often are timeouts firing? How often is the circuit breaker open, and for how long? How often are retries succeeding vs. exhausted? Each of these metrics tells you something different about how your dependency is behaving.


What These Tools Don't Do

It's worth being honest about the limits of this chapter's tools.

They protect you from a failing downstream. They don't fix the downstream service. If the payment service is down, circuit breakers and bulkheads mean your service stays healthy — but the payment service is still down and users still can't pay. You've contained the blast radius, but the fire is still burning.

They also don't help with data consistency. If you made a partial write before the failure, you may have left data in an inconsistent state. That's a different problem — one addressed by idempotency (Chapter 18) and the saga pattern (Chapter 13).

And they require ongoing maintenance. The "right" timeout value today may be wrong next year as your dependency's latency profile changes. Circuit breaker thresholds that work at current traffic may need tuning at 10x traffic. Build dashboards. Review settings periodically.

Chapter 11 — Summary

The Principle in One Sentence

Partial failure is worse than total failure — contain it to its origin using bulkheads, timeouts, circuit breakers, and disciplined retries so that one failing dependency cannot bring down the whole system.

The Most Common Mistake

Setting timeouts to 30 seconds "to be safe," or not setting them at all. A missing or oversized timeout is a thread that will block for minutes during an outage, exhausting your entire service's capacity one slow call at a time.

Three Questions for Your Next Design Review

  • Do all outbound calls have explicit timeouts, and are those values based on measured p99 latency — not guesses?
  • If the payment service (or any critical dependency) becomes slow — not dead, just slow — which thread pools, connection pools, or queues does it exhaust, and what falls over as a result?
  • When your circuit breaker trips on a dependency, what does the user experience? Does your service degrade gracefully, or does the open circuit breaker just produce a different kind of error?
← Back to All Chapters Chapter 11 of 38 Ch 12 →