Chapter 10  ·  Part III: Fault Tolerance

The Spectrum of Availability

Availability is not a binary. A system is not simply "up" or "down" — it exists on a spectrum, and the difference between 99.9% and 99.99% is not a rounding error. It is the difference between eight hours of downtime per year and fifty-two minutes. This chapter gives you the vocabulary, the math, and the organizational tools to reason about availability precisely.

What's in This Chapter

We start by translating availability percentages into real downtime numbers — because "five nines" sounds impressive until you know what it actually means. Then we build a precise vocabulary: SLIs, SLOs, and SLAs are not interchangeable, and confusing them causes real production incidents.

From there we cover error budgets — the single most powerful tool for managing the tension between shipping fast and keeping the lights on. Finally, we look at why availability is a property of a call chain, not a single service, and what that means when you're designing a system with ten microservices.

What Availability Actually Means

Let's start with the question that almost never gets asked clearly: available to do what?

A web server could be "up" in the sense that the process is running, but returning 500 errors to every request. A database could be accepting connections but taking 30 seconds to respond to each query. A payment service could be processing transactions but silently dropping 5% of them. Are any of these systems "available"?

The honest answer is: it depends on how you define it. And this is not a philosophical point. Systems get shipped with no clear definition of availability, and then engineers argue for months about whether a 3-hour incident "counted" as an outage or just "degraded performance." That argument is unwinnable because the terms were never defined upfront.

A useful working definition is: a system is available to a user if, from that user's perspective, it responds correctly within an acceptable time. Every word in that sentence is load-bearing:

From the user's perspective — not from the load balancer's perspective, not from the health check endpoint's perspective. If users are getting errors, the system is unavailable, even if your infra dashboards show everything as green.

Responds correctly — returning a 200 with the wrong data is not "available." Returning a cached stale response that's three days old might also not be "available," depending on what the user needs.

Within an acceptable time — a request that takes 60 seconds is functionally unavailable for most interactive use cases. Slow is the new down.

The Real Numbers Behind the Percentages

When someone says "we need four nines of availability," they often have not stopped to think about what that means in practice. Here is the math, laid out plainly.

Availability Downtime / Year Downtime / Month Downtime / Week
99% 87.6 hours 7.3 hours 1.7 hours
99.5% 43.8 hours 3.6 hours 50 minutes
99.9% 8.76 hours 43.8 minutes 10 minutes
99.95% 4.38 hours 21.9 minutes 5 minutes
99.99% 52.6 minutes 4.4 minutes 1 minute
99.999% 5.26 minutes 26 seconds 6 seconds
99.9999% 31.5 seconds 2.6 seconds 0.6 seconds

Look at the jump from 99.9% to 99.99%. You go from 43 minutes of allowed downtime per month to 4 minutes. That's a 10x reduction in the amount of time you're allowed to be broken. In practice, that means you cannot do a slow rolling deployment that takes 5 minutes. You cannot run a database migration that causes 8 minutes of degraded performance. A single on-call engineer who takes 6 minutes to respond to a page and fix an issue has blown through your entire monthly budget.

The engineering cost to go from 99.9% to 99.99% is not 10% more work. It is often a fundamentally different architecture — redundancy at every layer, automated failover, zero-downtime deployments, and significantly more operational overhead.

The Right Question to Ask

Before agreeing to any availability target, ask: "What is the cost of going from 99.9% to 99.99% for this specific system?" Then ask: "What is the business cost of 43 minutes of downtime per month?" If the answer to the second question is "mild inconvenience," you do not need four nines. Save the engineering budget for something that matters.

SLIs, SLOs, and SLAs — A Precise Vocabulary

These three terms get used interchangeably in most engineering conversations, and that causes real problems. They are not interchangeable. They describe three different things at three different levels of the system.

SLI — The Measurement

An SLI (Service Level Indicator) is a concrete measurement of something your system does. It is a number, produced by instrumentation. Examples:

An SLI is not a target. It is just a measurement. You can have a very accurate SLI that looks terrible — the measurement is fine, the system is bad.

Choosing the right SLI is harder than it looks. The most common mistake is measuring something that is easy to instrument rather than something that reflects the user's experience. CPU utilization is easy to measure. Whether users are actually having a good experience is harder. Always prefer user-facing SLIs — error rate, latency, and throughput as experienced at the application layer — over infrastructure metrics like CPU, memory, or disk.

Good vs. Bad SLI Choice

Bad SLI: "Pod CPU utilization is below 80%." — This tells you nothing about whether users are getting good responses.

Better SLI: "Percentage of requests returning a 2xx response." — Closer to the user, but misses latency.

Best SLI: "Percentage of requests returning a 2xx response within 300ms." — This captures both correctness and speed. A request that takes 45 seconds is not a success.

SLO — The Target

An SLO (Service Level Objective) is a target for an SLI over a measurement window. It is an internal engineering commitment, not a customer contract. Examples:

Notice the SLO has three components: a metric (the SLI), a threshold (the number), and a window (the time range). All three matter.

The window is where teams often make subtle mistakes. A 30-day rolling window is more forgiving than a calendar-month window during the first days of the month (you have buffer from the previous 29 days). A 5-minute window is unforgiving — a 10-second blip shows up as a 3% error rate. Think carefully about what window matches how users actually experience your system.

SLOs should be set slightly tighter than what you actually deliver, but not so tight that you're always in violation. A useful heuristic: set your SLO at a level where missing it is a real signal that something needs to change, not just background noise.

SLA — The Contract

An SLA (Service Level Agreement) is a contract between you and your customers. It has consequences when you miss it — usually financial, in the form of service credits. Examples: "If monthly uptime falls below 99.9%, customers receive a 10% credit on their bill."

The critical relationship between SLOs and SLAs: your SLA should always be weaker than your internal SLO. If you have an internal SLO of 99.9% availability and you sign an SLA promising 99.9%, you have no margin. Any week where you hit exactly your internal target, you are in SLA violation. You need the SLO to be harder — say 99.95% internally — so that even if you occasionally miss the internal target, you still stay above the external SLA commitment.

The Buffer Rule Internal SLO > External SLA > Actual need Example: Need 99.9% → Set SLA at 99.9% → Set internal SLO at 99.95%
A Common and Painful Mistake

Teams set their SLO equal to their SLA, then wonder why they're always issuing service credits. The SLA is the floor, not the target. If your floor is also your target, you are designing to barely stay out of penalty territory — and you won't, because systems always have bad days.

Error Budgets — The Most Useful Tool in Reliability Engineering

Here is a tension that exists in nearly every software organization: the engineering team wants to ship features fast. The reliability team wants to prevent incidents. These goals feel like they are in conflict, and so they become political. Engineers argue "this deployment is safe." Reliability engineers argue "we've had too many incidents lately." Nobody wins, and both sides feel frustrated.

The error budget is a tool that makes this conversation mathematical instead of political.

What an Error Budget Is

If your SLO is 99.9% availability over a 30-day window, that means you are allowed to be unavailable for 0.1% of that time. 0.1% of 30 days is 43.8 minutes. That 43.8 minutes is your error budget — the amount of unreliability you are allowed to "spend" in a month.

99.9%
SLO target
43.8 min
Monthly error budget
10.9 min
Budget per week
0 min
Budget remaining if any deploy causes 45+ min outage

This budget can be spent in any way. A planned deployment that causes 5 minutes of degraded performance uses 5 minutes of the budget. An unexpected incident that causes 30 minutes of high error rates uses 30 minutes. A bad config push that takes 8 minutes to roll back uses 8 minutes. When the budget is gone, it's gone — you've used up your allowed unreliability for the month.

How to Use the Budget to Make Decisions

The power of error budgets is that they turn the reliability vs. velocity trade-off into an explicit, shared resource with clear rules:

When the budget is healthy — you have plenty of error budget left — go fast. Ship features. Run experiments. Deploy frequently. You have headroom and you should use it. Sitting on unused error budget is waste; it means you're being more reliable than you need to be, and that extra reliability has a cost (slower deployments, more conservative changes).

When the budget is low — you have consumed most of your error budget — slow down. Stop non-critical deployments. Focus the team on reliability work: fixing the flaky service, adding better circuit breakers, improving alerting. The budget being low is a signal, not a punishment.

When the budget is exhausted — you have spent all of it — stop. No new feature deployments until the window resets or you fix the underlying reliability problem. This is not a suggestion; it is the agreed-upon rule, established before any incident happened.

Why This Matters Organizationally

Without error budgets, the reliability conversation is always qualitative: "I think we've been too risky lately." With error budgets, it's quantitative: "We've burned 38 of our 43 minutes this month, so we're pausing non-critical deploys for the next two weeks." No one can argue with the math. The rules were set in advance. The budget is either there or it's not.

This is why error budgets are as much an organizational tool as a technical one. They resolve a political argument by replacing it with an agreed-upon metric.

The Error Budget Policy

The error budget itself is just a number. The error budget policy is the written agreement that describes what happens at different budget levels. Without a policy, the budget is just a dashboard that no one acts on.

A good policy answers these questions:

This policy must be written and agreed upon by both the product team and the engineering team before a crisis. If you try to establish the policy during an incident, you'll get arguments instead of alignment.

Measuring the Budget Correctly

There is a subtle but important question: what counts toward burning the error budget? There are a few schools of thought.

Everything counts — any time your SLI falls below the SLO threshold, you burn budget. This includes planned maintenance, voluntary degradation, external dependency failures, and your own bugs. The argument for this: users don't care why they experienced downtime. From their perspective, a planned maintenance window and an unexpected incident are both "the service was unavailable."

Some things are excluded — some teams exclude budget burn from causes outside their control (e.g., a cloud provider's regional outage). The argument: your team should not be penalized for failures they couldn't have prevented. The counter-argument: this creates perverse incentives to blame external causes.

There is no universally right answer, but the more important principle is: whatever you decide, be consistent. Changing the rules mid-month when things look bad destroys the whole point of having an objective measure.

Availability Is a Property of a Call Chain

Here is one of the most important and most overlooked facts about availability in distributed systems: the availability of a system that calls other systems is always lower than the availability of any single component.

If your service calls three other services to handle a request, and all three need to succeed for the request to succeed, then your service can only be as available as all three combined.

Availability of a Serial Dependency Chain A(system) = A(service₁) × A(service₂) × A(service₃) × ... Each service in the critical path multiplies down the total availability

Let's make this concrete. Suppose you have a request that passes through five services, each with 99.9% availability:

// Five services, each at 99.9% availability
A = 0.999 × 0.999 × 0.999 × 0.999 × 0.999
A = 0.999⁵
A0.995   // 99.5% — you've lost half a nine

// Ten services, each at 99.9%
A = 0.999¹⁰ ≈ 0.990   // 99.0%

// Twenty services, each at 99.9%
A = 0.999²⁰ ≈ 0.980   // 98.0% — nearly two full nines gone

This is the hidden cost of microservices that rarely gets discussed in architecture reviews. Every service you add to the call path of a critical request reduces your end-to-end availability — even if each individual service is rock solid.

This is why a monolith can actually have higher availability than a microservices architecture, even if individual microservices are more reliable than the equivalent components in the monolith. The monolith has fewer network hops, fewer failure boundaries, fewer points of failure in a single request's critical path.

How to Improve Call Chain Availability

There are four main strategies, and they're not mutually exclusive:

1. Shorten the chain. Every service you remove from the critical path improves end-to-end availability. Ask whether every dependency is truly necessary for the primary path, or whether it can be made asynchronous.

2. Make dependencies asynchronous. If Service B doesn't need to respond before you can reply to the user, don't put it in the synchronous critical path. Write to a queue and process it later. Your availability is no longer coupled to B's availability for that request.

3. Use fallbacks and degraded modes. If Service C is unavailable, can you return a useful response without it — perhaps a cached result, a default value, or a simplified version? This is covered in depth in Chapter 14, but the key point here is: a service that can degrade gracefully when a dependency is down contributes much less to end-to-end unavailability.

4. Set higher SLOs for shared dependencies. If twenty services depend on Service X, and Service X is at 99.9%, it is putting a ceiling on the availability of all twenty of those services. Shared, high-fan-in dependencies need to be held to a higher standard than leaf services. They should have tighter SLOs, more redundancy, and more careful deployment practices.

The Shared Infrastructure Trap

It is very common for teams to build a "platform service" used by every other team — an auth service, a configuration service, a feature flag service — and then treat it with the same availability bar as any other service. This is a mistake. If fifty services depend on your auth service and it goes down for 5 minutes, you haven't had one 5-minute outage. You've had fifty simultaneous 5-minute outages. The blast radius of a high-fan-in dependency is enormous. Its availability requirements should be set proportionally.

Parallel Dependencies vs. Serial Dependencies

Not all call chains are serial. Sometimes you call multiple services in parallel and need at least one to succeed (or need all of them). The math is different in each case.

For parallel calls where all must succeed, the formula is the same as serial — multiply. It does not matter that the calls happen simultaneously; if any one fails, the request fails.

For parallel calls where only one needs to succeed (e.g., you have a primary and a fallback), the math is much more forgiving:

Availability with Redundant Fallback A = 1 − (1 − A₁) × (1 − A₂) Two 99.9% services in fallback config → 1 − (0.001 × 0.001) = 99.9999%

Two independent services at 99.9% in a primary/fallback configuration give you six nines. This is why redundancy is so powerful — and why the default response to "we need higher availability" is usually "add a redundant instance," not "make the single instance more reliable."

How to Measure Availability in Practice

You cannot manage what you cannot measure, and measuring availability correctly is harder than it looks.

Synthetic Monitoring vs. Real User Monitoring

Synthetic monitoring means you have external probes that send requests to your service at regular intervals and record success/failure. These probes look like users, but they are not users. They hit the same few endpoints, with the same simple request, from a small number of known locations. They are good for detecting complete outages quickly.

Real user monitoring (RUM) means measuring success and failure rates in your actual production traffic. This is more accurate — it reflects what real users actually experience — but it requires you to have production traffic, and errors in low-traffic time windows can be statistically noisy.

The right answer is usually both. Synthetic monitoring catches outages fast, before real users are affected at scale. Real user monitoring catches subtle degradation that probes miss — like a particular request pattern that only real users trigger, or a geographic region that's degraded but not completely down.

The Window Problem

Availability is always measured over a time window, and the choice of window changes what you can see.

A short window (minutes to hours) is sensitive to brief spikes — a 2-minute incident looks catastrophic in a 5-minute window. This is good for alerting, where you want to be paged quickly. It is bad for SLO measurement, where a 2-minute incident out of 30 days should not look like you're violating your SLO.

A long rolling window (28 or 30 days) smooths out brief incidents. This is appropriate for SLO measurement and error budget tracking. The rolling window is better than a calendar month because it does not reset to zero at midnight on the 1st — your error budget carries over continuously.

A very long window (quarters or years) is useful for understanding trends but too slow for operational decisions. If your availability has been degrading for six months and you only notice at the quarterly review, it is too late.

In practice, most teams maintain two views: a short-window view for alerting and incident response, and a rolling 30-day view for SLO tracking and error budget management.

The Most Common Mistakes

Having seen how availability should work in theory, here are the places where things go wrong in practice — repeatedly, predictably, at companies of all sizes.

Setting SLOs Too Tight

There is a temptation, especially when talking to customers or executives, to set ambitious SLOs. "We target five nines!" sounds impressive. But if you are a team of 8 engineers running a service that you deploy twice a week, five nines is simply not achievable without heroic operational work — and that heroism has a cost. Engineer burnout, slow development, and constant fire-fighting are all symptoms of an SLO that is set too high for the team's actual capacity.

The right SLO is one that represents genuinely good service for your users without requiring unsustainable operational effort. For most internal services, 99.9% is perfectly reasonable. For customer-facing services at significant scale, 99.95% is ambitious but achievable. For critical financial infrastructure, 99.99% requires dedicated reliability engineering investment. Five nines is appropriate for a handful of the most critical systems in the world.

Measuring the Wrong Thing

Your health check endpoint returns 200 OK. Your monitoring says availability is 100%. But users are getting errors on every checkout because the payment service integration is broken — and your health check doesn't test that path. This is measuring the wrong thing.

The health check measures whether the server is up. It does not measure whether users are having a good experience. Always validate that your SLI measurement actually correlates with user outcomes. A useful test: if your SLI says 99.9% but users are filing complaints about errors, your SLI is measuring the wrong thing.

No Error Budget Policy

Many teams define SLOs and even track error budgets, but never write a policy for what to do when the budget runs low. The budget becomes a number on a dashboard that everyone glances at and forgets. When you finally exhaust the budget, there is no agreed-upon response, so you get the same political argument that error budgets were supposed to prevent.

Write the policy before you need it. Make it short, specific, and pre-agreed. "If we burn more than 50% of our monthly error budget in a single week, we pause all non-critical deployments for the following week." That is actionable. "We should be more careful about availability" is not.

Treating Availability as a Single-Service Property

The most common mistake at the system design level: teams optimize their individual service to have high availability and consider the job done. But no user experiences a single service in isolation. They experience the full request path. If your service is at 99.99% but the database it depends on is at 99.9%, the users experience 99.9% availability. Your extra nines are invisible to them.

When reasoning about availability targets, always think about the full call path from the user's request to the response. Map every dependency, identify the weakest link, and either accept that the chain's availability is bounded by that weak link or invest in making that dependency more reliable.