Part VII Chapter 27

The On-Call Experience as a Design Constraint

The way you design a system today determines how much sleep your team loses tomorrow. On-call burden is not an operations problem — it is a direct consequence of the decisions made in code review.

What's Coming in This Chapter

We start by defining toil — the class of work that scales with traffic but produces no lasting value — and why eliminating it is real engineering work, not a nice-to-have. Then we look at alert fatigue: what causes it, why it is a design failure rather than an operations problem, and what a well-designed alert actually looks like. We cover how to write runbooks that someone will actually use at 3am. We walk through the structure and culture of blameless post-mortems, including what "blameless" does and doesn't mean. And we finish with the five-year test — a simple forcing function for evaluating whether the system you are building is operationally honest.

Key Learnings — Quick Glance

  1. On-call burden is a design output. The amount of pain your system causes at 2am was determined by choices made during design and code review — not after deployment.
  2. Toil is a tax on growth. Manual, repetitive operational work scales with traffic. Engineering work scales your system's value. Budget explicitly for eliminating toil or it will crowd out everything else.
  3. Alert fatigue is a signal-to-noise failure. Every noisy alert trains your team to ignore pages. This delay in response is what turns minor problems into incidents.
  4. Only alert on symptoms, not causes. "Error rate is 5%" is a symptom. "Database CPU is 80%" is a cause that may or may not matter. Alert on what users experience, not what machines report.
  5. A runbook that needs an expert to interpret it is not a runbook. If a new engineer cannot follow it at 3am without calling someone, rewrite it.
  6. Blameless post-mortems look for system failures, not human failures. If a reasonable engineer made a mistake that caused an incident, the system allowed that mistake to happen. Fix the system.
  7. The five-year test exposes operational debt early. Ask: would I be comfortable on-call for this system in five years, with 5× the current traffic, and a team that has never spoken to me?
  8. Three alert tiers, not one. "Wake me up at 3am," "tell me in the morning," and "log for weekly review" are fundamentally different severities. Collapsing them into one channel destroys signal.

Introduction

There is a version of a post-mortem that almost every engineering team has written. The incident happened at 2am on a Tuesday. An alert fired. The on-call engineer woke up and spent 45 minutes figuring out what the alert even meant. They found a dashboard that was confusing. They restarted a service, which fixed it temporarily. They went back to sleep. The same alert fired again at 5am.

What caused this? You might say it was a bad alert threshold, or poor documentation, or an unstable service. But these are symptoms of the same root cause: the engineers who designed and built this service never considered the on-call experience as part of the design. They thought their job ended at deployment. It doesn't.

The operational experience of a system — how it behaves when it fails, how easy it is to diagnose, how clear it is to fix — is as much a product of design decisions as its performance or correctness. A system that performs perfectly under normal conditions but turns every outage into a mystery is not a well-designed system. It is a system with deferred costs, and those costs show up in the sleep and sanity of whoever is on-call.

This chapter is about making the on-call experience a first-class design concern, not an afterthought.

1 — Toil: The Work That Scales With Traffic

Google's Site Reliability Engineering book coined the term toil to describe a specific class of work. Toil is work that is manual, repetitive, automatable, and produces no lasting improvement. The defining characteristic is this: as your system grows, toil grows proportionally. Engineering work, on the other hand, produces leverage — you do it once and the benefit compounds.

Here are some concrete examples of toil:

Manually restarting a service
The service has a memory leak you haven't fixed. Every few days someone restarts it. If traffic doubles, you restart it twice as often. No lasting value is produced.
Scaling up by hand
Traffic spikes every Friday evening. An engineer manually increases the replica count in the Kubernetes deployment. Then decreases it Sunday night. Every week, same steps.
Manually clearing a queue
A downstream service occasionally falls behind and a queue fills up. Someone SSHes into the worker, identifies the stuck jobs, and deletes them. No automation, no self-healing.
Running a data migration script on a schedule
Every Monday morning, someone runs a Python script to backfill a column that should have been handled by the write path. It just never got fixed.
Responding to alerts that require the same fix every time
The alert fires. The runbook says "run this command." The engineer runs the command. Problem resolved. Every single time. No one has automated the fix.

What all of these have in common is that they are work with a 0% interest rate. You invest time, you get nothing back. And they compound in the wrong direction: as your system grows, you invest more time in them, leaving less time for work that actually improves things.

Measuring Your Toil

Most teams do not know how much toil they have, because toil is invisible. It happens in Slack DMs, in quick SSH sessions, in someone heroically "just handling it." The first step to eliminating toil is making it visible.

A simple technique: ask every engineer on-call to keep a log for two weeks. Every time they take an action — responding to an alert, running a manual process, fixing something in the console — they write it down. At the end of two weeks, categorize each action. Anything you did more than twice without automation is toil. Anything that required no judgment (just execute these steps) is a candidate for elimination.

Insight

In healthy teams, toil should be less than 50% of on-call time. In many teams it is 70–80%. The difference is years of accumulated shortcuts that no one had the budget to fix. That budget has to be created explicitly — toil elimination does not happen naturally, because there is always something more urgent to build.

Toil Elimination Is Engineering Work

This sounds obvious but it is consistently underprioritized. Automating the Friday scaling is a real engineering task. Fixing the memory leak is a real engineering task. Building a self-healing mechanism for the stuck queue is a real engineering task. These should be on the roadmap, tracked in your project management tool, estimated, and delivered — not handled as informal cleanup work between "real" features.

The reason they are usually not is that they do not have visible users or stakeholders pushing for them. Features have PMs. Toil has no one. So the engineering team has to advocate for itself and frame toil elimination in terms leadership understands: "We spend 10 hours a week on manual scaling. Automating it will take 3 days and save 40+ hours per month." That math is usually easy to sell.

Common Mistake

Treating the toil log as a "nice to have if we have spare cycles" exercise. Spare cycles never come. You have to cut something to make room for toil elimination, and that trade-off should be made explicitly, not left to chance.

2 — Alert Fatigue Is a Design Failure

You have probably been on a team where the on-call phone rang several times a night and people started sleeping through it. Or where an alert fired so often that everyone had learned to glance at it and immediately close it. This is alert fatigue, and it is one of the most dangerous states an on-call rotation can be in.

The danger is not that engineers get annoyed. The danger is that when a real, severity-one incident starts, the response is delayed because the team has been trained — through hundreds of false positives — that alerts are probably not serious. The alert that signals a true 10-minute-to-outage situation gets the same initial skepticism as the one that has fired every day for three months and never required action.

How Alert Fatigue Develops

Alert fatigue almost always develops through the same pattern. A team starts with a few good alerts. They add more alerts over time, because more alerts feels like more safety. Some of these are well-calibrated. Some are not — the threshold is too sensitive, or the condition it detects does not actually require action, or the problem it monitors self-resolves. But alerts are rarely deleted. They accumulate.

Over 12 to 18 months, a team that started with 10 meaningful alerts now has 60. Of those 60, perhaps 15 are genuinely actionable. The rest are noise. But no one knows which 15 those are without digging in each time. The signal-to-noise ratio has collapsed.

The Asymmetry Problem

Adding an alert takes 5 minutes. Evaluating whether an alert should be deleted requires looking at its history, understanding its original intent, and being willing to accept the risk of missing something. This asymmetry means alerts accumulate naturally and only get pruned through deliberate effort.

The Real Cost of a Noisy Alert

It is worth being precise about what a noisy alert costs, because the cost is often underestimated.

If an alert pages an engineer at 3am and no action is required, the cost is not just "5 minutes of their time." It is the time to wake up, the time to orient, the time to investigate enough to confirm no action is needed, and then the time to fall back asleep — which frequently does not happen cleanly. Cognitive research suggests a full sleep interruption costs roughly 90 minutes of effective sleep and impairs performance for the following day. A single false-positive page has a real, measurable impact on human beings.

Multiply this across a week, across a team, and you start to see why noisy on-call rotations lead to burnout, turnover, and eventually production incidents caused by fatigued engineers making mistakes.

The Three-Tier Alert Model

One of the most practical improvements a team can make is to stop treating all alerts as equal. There are three fundamentally different levels of urgency:

Tier What It Means How It's Delivered Example
Page User-facing impact is happening now. Requires immediate human action regardless of time of day. PagerDuty / phone call Error rate above SLO threshold for 5 minutes
Ticket Something needs attention, but it can wait until business hours. No immediate user impact. Slack / Jira ticket Disk usage at 75%, growing at current rate will hit 90% in 3 days
Log Worth knowing about during a weekly review but not worth interrupting anyone. Dashboard / weekly digest P95 latency slightly elevated on Tuesdays

Most teams have only one tier — everything goes to the same channel. This forces engineers to apply the mental overhead of triage to every notification. Separating these three tiers means the phone only rings when a human needs to act right now.

Alert Hygiene: The Quarterly Review

Good alert hygiene is not a one-time project. It is a recurring practice. Once a quarter, go through every alert and ask three questions:

1
Has this alert fired in the last 90 days?

If it has never fired, either the condition is impossible in production or the threshold is miscalibrated. Either way, investigate before keeping it.

2
When it fired, did someone take action?

Pull the on-call logs. If the alert fired 40 times and the only documented response is "investigated, no action needed," the alert is noise. Raise the threshold or delete it.

3
Is there a runbook for this alert?

If there is no runbook, the alert is relying on tribal knowledge. If the person who knows what to do leaves the team, this alert becomes an incident waiting to happen.

3 — Anatomy of a Good Alert

A good alert has three properties. It is actionable — receiving it, you know what to do. It is symptom-based — it describes what users are experiencing, not what machines are doing. And it is calibrated — the threshold is set to minimize both false positives and false negatives.

Alert on Symptoms, Not Causes

This is the single most important principle in alert design, and the one most often violated.

Causes are things like high CPU, memory pressure, disk usage, queue depth, thread pool saturation. Symptoms are things like elevated error rate, increased latency, requests failing, users unable to complete actions. The distinction matters because causes frequently do not produce symptoms — a CPU spike during a batch job is expected and harmless. And symptoms can happen with no obvious cause — a latency spike might come from a network partition that no internal metric reflects.

If you alert on causes, you produce a huge volume of pages that may or may not require action. If you alert on symptoms, you page only when users are actually experiencing something. The mental model shift is this: your alerts should represent your service's obligations to its users, not its internal implementation.

Concrete Example

Cause-based alert: "Database connection pool utilization > 80%." This fires whenever traffic is high. Most of the time, the service is handling requests fine and users notice nothing. The on-call engineer investigates, finds no user impact, and goes back to sleep. Ten times a month.

Symptom-based replacement: "HTTP 500 error rate > 0.5% for 5 minutes." This fires only when users are receiving errors. When it fires, action is clearly required. And when the database connection pool actually causes user-facing errors, this alert catches it.

SLO-Based Alerting and Burn Rate

A more sophisticated version of symptom-based alerting ties your alerts directly to your Service Level Objectives. The idea is straightforward: instead of alerting "error rate is high right now," you ask "are we consuming our error budget fast enough that we will miss our monthly SLO?"

Suppose your SLO is 99.9% availability over 30 days. That gives you a budget of roughly 43 minutes of downtime. If your error rate spikes to 100% for 5 minutes, you have consumed 5/43 = ~12% of your monthly budget in 5 minutes. That is a burn rate of about 72× the sustainable rate. You should definitely be paged.

But if your error rate ticks up to 0.2% for an hour — which is above your normal baseline — you have consumed about 12 minutes out of 43. Still meaningful, but not an emergency. This belongs in the "ticket" tier, not a page.

The burn rate model lets you tune alert sensitivity based on how quickly you are exhausting your SLO budget, rather than reacting to instantaneous metric spikes. This dramatically reduces noise from brief, self-resolving conditions.

The Four Golden Signals

If you are not sure where to start with alert design, Google's SRE book offers a useful framework. For any user-facing service, monitor these four signals:

Latency
How long does it take to service a request? Track both successful requests and failed requests separately — high latency on errors can mask a different problem than high latency on successes.
Traffic
How much demand is your system receiving? Requests per second, active connections, messages per second. Useful context when diagnosing other signals — a latency spike is different when traffic is also 3× normal.
Errors
The rate of requests that are failing. Explicit failures (HTTP 500) and implicit failures (HTTP 200 with wrong data, or requests that timeout) both count.
Saturation
How "full" is your service? CPU, memory, disk, queue depth. This is the one cause-based signal worth monitoring — but watch for it as a leading indicator rather than paging on it directly.

4 — Runbooks That Actually Get Used

A runbook is a document that tells an on-call engineer exactly what to do when a specific alert fires. The word "exactly" is doing a lot of work in that sentence.

Most runbooks fail the most important test: can a competent engineer who has never seen this service follow this document at 3am and resolve the problem without calling anyone? If the answer is no — if the runbook assumes knowledge the reader might not have, or if it says "escalate to the service team" at every decision point, or if it has not been updated since the architecture changed six months ago — it is not a runbook. It is a false sense of security.

Structure of an Effective Runbook

A runbook should have a consistent structure so that someone can orient quickly. Here is a template that works well in practice:

1
Alert name and what it means

One or two sentences explaining what the alert is monitoring and what it means when it fires. Not how the system works internally — what the user is experiencing or about to experience.

2
Immediate triage — the first three things to check

Exact links to the dashboards to open, exact queries to run, exact log patterns to search for. No ambiguity. The engineer should be able to go from "alert fired" to "I understand what is happening" in under 5 minutes.

3
Decision tree

"If you see X in the logs, do Y. If you see Z, do W." Not prose. An explicit decision tree. This is what allows a non-expert to navigate to the right action without guessing.

4
Known fixes with exact commands

For common failure modes, the exact commands to run. Not "scale up the workers" — kubectl scale deployment worker --replicas=8 -n production. Exact, copy-pasteable, with a note about what each command does.

5
Escalation path

When to escalate, who to escalate to, and what information to bring when you do. Escalation should be the last resort in the decision tree, not the first step.

The Decay Problem

Runbooks go stale. The dashboard URL changes. The service is renamed. A new fix procedure replaces an old one. If runbooks are not maintained, they become actively harmful — an engineer follows outdated steps, wastes time, and loses confidence in the documentation.

A few techniques help with this. First, link to dashboards by title and search term rather than hard-coded URLs — URLs change, names usually don't. Second, make runbook updates part of the definition of done for any change that affects the failure modes of a service: if you change how the service behaves under load, update the runbook. Third, review runbooks as part of every post-mortem — if an incident revealed that the runbook was wrong or incomplete, fix it before you close the post-mortem.

Testing Your Runbooks

The most effective test of a runbook is to have a new team member — someone with no prior knowledge of the service — follow it in a controlled environment (staging, or a game day) and see if they can resolve a simulated incident. Whatever confuses them is a gap in the runbook. This is also a good onboarding exercise: it forces documentation improvement while simultaneously building the new engineer's knowledge.

Insight

Many teams treat runbooks as documentation they write after building a system. The more effective habit is to write the runbook before you finish building the system — specifically, before the first production deployment. Writing the runbook forces you to think through the failure modes. You will often discover, while writing the runbook, that you have not instrumented something you need to diagnose the system, or that a recovery procedure you planned is not actually possible. Far better to find this at writing time than at incident time.

5 — Blameless Post-Mortems

When something goes wrong in production, there are two things you can do with it. You can use it as evidence to assign fault. Or you can use it as data to improve your system. These are not the same thing, and a culture that does the first consistently prevents the second.

The blameless post-mortem is the practice of writing a detailed analysis of what went wrong, how it was detected, how it was fixed, and what needs to change — without attributing the incident to individual mistakes or naming people as the cause of the problem.

Why Blame is Counterproductive

When engineers expect blame, they do three things that are bad for reliability. They hide problems — if telling people about a near-miss means getting questioned, the incentive is to not tell people. They take fewer risks — if trying something new could result in an incident that gets attributed to you personally, the rational response is to not try new things. And they do not engage honestly in post-mortems — if a post-mortem is a mechanism for assigning fault, people focus on justifying their decisions rather than examining what the system got wrong.

All three of these behaviors make systems less reliable over time. Hiding problems means they go unfixed. Avoiding risk means the system does not improve. Dishonest post-mortems mean the same incidents recur.

The Key Insight

If a reasonable, experienced engineer made a decision that contributed to an incident, the system allowed that decision to be made. Maybe the deployment tooling did not prevent a bad config from reaching production. Maybe the monitoring did not make it obvious that something was degrading. Maybe the process did not require a second pair of eyes on that change. The individual made a mistake — that is true. But the system created the conditions for that mistake. Fix the system.

Structure of a Blameless Post-Mortem

A post-mortem document does not need to be long. It needs to be honest and precise. Here is what it should contain:

Impact Summary
What users experienced. How many were affected. How long the impact lasted. Written in user terms, not system terms. "Users could not complete checkout for 34 minutes" rather than "the payment service returned 500s."
Timeline
A chronological sequence of events in UTC, from the first sign of trouble to full resolution. Just facts. "14:23 — alert fired. 14:31 — on-call engineer began investigation. 14:47 — root cause identified." No editorializing, no blame.
Contributing Factors
The conditions that made this incident possible. Note the plural: real incidents almost always have multiple contributing factors. A bug is not a contributing factor by itself — a bug plus no test coverage plus a deployment process that does not gate on test failures is a system that allowed the bug to reach production.
Root Cause
The deepest systemic issue — the thing that, if fixed, would prevent this class of incident from recurring. Use the "five whys" technique: keep asking "why did that happen?" until you reach a systemic answer rather than an individual one.
Action Items
Concrete changes to prevent recurrence or improve detection and response. Each action item should have a single named owner, a target date, and a link to the tracking issue. Vague action items ("improve monitoring") are not action items.

The Five Whys in Practice

The five whys is a technique for drilling through surface-level causes to find systemic ones. Here is an example:

Example: Five Whys

Incident: Production database ran out of disk space, causing write failures for 20 minutes.

Why did the database run out of disk space? — Disk usage grew faster than expected due to an unindexed query generating a large temp table on every request.

Why was there an unindexed query? — A developer added a new filter last week without adding a corresponding index.

Why wasn't the missing index caught? — Our code review checklist does not include a prompt to check query plans for new database queries.

Why don't we have disk usage alerts? — We have alerts for application metrics but not for infrastructure metrics on the database host.

Why don't we have infrastructure alerts? — When we migrated to the managed database service six months ago, we carried over application alerts but did not audit what infrastructure monitoring the managed service provides and what we need to add.

Root cause: Our alert coverage audit process did not extend to managed infrastructure migrations.

Notice how different this root cause is from "a developer forgot to add an index." The developer made a mistake. But the system had five layers of process that could have caught it and did not. The post-mortem now has clear, systemic action items: add a query plan review step to code review, add infrastructure disk alerts, and add a managed service migration checklist that includes alert coverage audits.

What Blameless Does Not Mean

Blameless culture is sometimes misunderstood as having no accountability. That is not right. Engineers are still responsible for the quality of their work. Someone who repeatedly ships bugs because they are not testing, or who consistently skips process steps, is not protected by "blamelessness." Consistent patterns of poor practice are a management conversation, not a post-mortem topic.

Blameless means that a single mistake by a reasonable engineer in reasonable circumstances is treated as a data point about the system, not a character flaw. It means the post-mortem does not name individuals in the root cause section. It means people feel safe enough to report near-misses before they become incidents.

The Post-Mortem Theater Problem

Many teams write post-mortems and then do nothing with them. The action items sit in a document. No one tracks them. The same incident happens six months later. The team writes another post-mortem. This is worse than useless — it teaches people that post-mortems are a ritual to satisfy management, not a tool for improvement. The result is that people stop engaging honestly.

The fix is simple but requires discipline: every action item from a post-mortem is tracked in the same system as engineering work. It gets a ticket, an owner, a priority, and a sprint. It is reported on in team status updates. The post-mortem is not closed until the action items are done or explicitly deferred with a reason. This is not bureaucracy — this is the difference between a post-mortem culture and a post-mortem theater culture.

6 — The Five-Year Test

Here is a question worth asking about every significant system you build: would you be comfortable being on-call for this system in five years, with five times the current traffic, and a team that has completely turned over and never spoken to you?

This question is a forcing function. It makes the operational future of the system concrete. And it exposes a specific kind of technical debt that rarely shows up in code review: operational debt.

What Fails the Five-Year Test

Systems that fail this test share a few common properties:

Tribal knowledge dependencies
The system works because someone knows which config flag to set in which order, or because a specific person knows the one query that reveals what is actually wrong. When that person leaves, the knowledge leaves with them. A system that requires its author to operate it is not operationally honest.
Invisible failure modes
When the system fails, you cannot tell from the outside what is wrong. No error messages, no metrics, no logs — just requests timing out. The on-call engineer has to read source code to diagnose production incidents. This is a design failure, not a monitoring gap.
No documented recovery procedures
The team knows how to recover from a failure because they have done it before. But it exists only in memory. A new team member inheriting this system inherits a mystery box.
Complexity without proportionate value
The system has ten services where three would do, or it has five layers of caching with subtle interaction effects that only appear under specific traffic patterns. Complexity has a direct operational cost. Every component that can fail independently is a failure mode someone has to understand.
No tested rollback path
The system has been running for two years. No one has ever tested rolling back a deployment because it has never been needed. When it is needed — and it will be — the procedure is theoretical, untested, and likely broken.

Using the Test During Design

The five-year test is most useful as a design-time question, not a post-deployment evaluation. When you are writing a design document or reviewing a proposed architecture, ask explicitly: what would it be like to operate this system in five years?

Walk through the failure scenarios. If the primary database fails, can someone who has never seen this system recover it from the runbook? If a queue fills up, does the system self-heal or does it require manual intervention? If a deployment goes bad, how long does rollback take and does it work? These are not hypothetical concerns. These are operational requirements that should be designed in, not bolted on.

A Useful Reframe

Think of your future on-call engineer as a user of your system. Just as you design APIs to be intuitive for the developers who call them, design your operational surface to be intuitive for the engineers who will own it. Good operations UX means: failure modes are obvious, recovery procedures are documented, alerts are meaningful, and common actions are automatable. You are building a product for two audiences — your external users, and your future self at 3am.

7 — Putting It Together: The Production Readiness Review

Many teams formalize these concerns in a production readiness review (PRR) — a checklist that a service must pass before it is allowed to go to production. The specific items vary by organization, but a solid PRR for on-call readiness looks something like this:

Area Questions to Answer
Observability Are the four golden signals instrumented? Are logs structured? Is there a service dashboard with meaningful panels?
Alerting Does every alert have a runbook? Is every page-worthy alert symptom-based? Have alert thresholds been validated against historical data?
Runbooks Is there a runbook for each alert? Has a new team member been able to follow it in staging? Is it linked from the alert itself?
Failure modes What are the top five ways this service can fail? Is each one detectable? Is each one recoverable without the original author?
Rollback Has a rollback been tested? How long does it take? Does it require downtime?
On-call rotation Is there an on-call rotation? Does it have at least two engineers? Do all on-call engineers have production access?
Capacity What is the expected peak load? Has the service been load-tested beyond that? What happens when capacity is exhausted — does it fail gracefully?
Dependencies What does this service depend on? What happens to this service if each dependency fails? Is there a fallback or circuit breaker?

A PRR is not a bureaucratic gate. It is a forcing function for conversations that should happen before production traffic reveals the gaps. It is much cheaper to discover "we have no rollback procedure" during a PRR than during an incident.

The goal is not to check every box perfectly before launch. The goal is to make the operational risks explicit and ensure that the team has consciously accepted each one, rather than discovered it by accident. Some items on the checklist might legitimately be deferred — and that is fine, as long as it is a deliberate decision with a plan, not an oversight.

Key Principle

The operational experience of a system is a direct output of the decisions made during design and code review. Every choice that makes a system harder to diagnose, harder to recover from, or harder to hand off is technical debt — and it will be paid with interest by whoever is on-call at 3am.

Most Common Mistake

Treating on-call burden as an operations problem rather than an engineering problem. Teams add more monitoring dashboards, add more alerts, write longer incident channels — while the underlying system remains undocumented, un-runbooked, and opaque. The monitoring layer cannot compensate for a system that was not designed to be observable. The fix has to happen at the design level, not the operations level.

Three Questions for Your Next Design Review
  1. If this service fails at 3am six months from now and the engineer on-call has never seen it before, what are the first three things they will need to know — and where exactly is that information written down today?
  2. For each alert we plan to add, can we describe the specific action the on-call engineer should take when it fires? If we cannot, should we be alerting on it at all?
  3. Apply the five-year test: if we handed this system to a new team tomorrow, with 5× the current traffic, what would their first painful on-call week reveal about what we have not documented, not automated, or not thought through?