Ch 38 — Building a Culture of Reliability | Principles of Distributed Systems Design

The Heroism Trap

Most engineering teams have at least one person who is, quietly, responsible for keeping the system alive. They get paged first. They know where the bodies are buried. When something breaks badly, everyone waits for them to look at it. Their laptop is always open at dinner.

Teams often celebrate this person. Their manager calls them invaluable. They get good performance reviews. And the whole time, the team is sitting on a time bomb.

Here is the problem: reliability that depends on one person being available and motivated is not reliability. It is a single point of human failure. It has no redundancy. It does not scale. It burns out the person in the center of it. And it quietly teaches every other engineer on the team that operational excellence is someone else's job.

"If your system requires a hero to survive, your system is not reliable. Your hero is."

The heroism trap forms because it is easier to call the expert than to fix the underlying problem. Every time you call the expert instead of fixing the underlying problem, you make the expert more necessary and the system more fragile. The short-term cost of fixing the root cause feels higher than the short-term cost of the hero solving the incident. But the long-term cost compounds badly.

What heroism is hiding

When a hero regularly saves the day, it usually means at least one of these is true:

The system's failure modes are not documented anywhere. Only the hero knows what to do when service X crashes.
Alerts are noisy and nobody trusts them, so only the hero can distinguish signal from noise.
The runbooks are out of date or do not exist. Only the hero's mental model matches the current system.
Deployments are manual and fragile. Only the hero knows the right sequence of steps.
Recovery requires access or permissions that most engineers do not have.

Every one of these is a systemic problem, not a people problem. And every one of them can be fixed. The hero did not create these problems, but their availability to patch over them has allowed the problems to persist.

The right question

After any incident that required the hero, ask: "What would have happened if this person had been on a plane with no internet?" If the honest answer is "we would have been in serious trouble," you have found your highest-priority reliability work.

The goal is not to remove the hero. The goal is to make the hero unnecessary for routine incidents, so they can focus on the genuinely hard problems that actually do require their depth of knowledge.

Production Readiness Reviews

A production readiness review (PRR) is a structured checkpoint before a system goes live. Done badly, it is a bureaucratic gate that teams resent and rush through. Done well, it is the moment where a team forces themselves to think through what they have not thought about yet.

The key insight is that the checklist is not the point. The point is the conversation the checklist forces.

What a PRR is actually checking

Most teams default to checklists that look like this: "Do you have a README? Do you have tests? Do you have a deployment pipeline?" These are fine hygiene checks but they do not capture what you actually need to know before something goes to production.

A useful PRR asks different questions — questions that force the team to have thought about their system in production, not just in development:

Category	The Questions That Actually Matter
Failure modes	What are the top 5 ways this system can fail? What does the user experience when each one happens? Have you tested each one?
Observability	If this is broken at 3 AM, what will the alert say? Can the on-call engineer — who may have never touched this system — follow a runbook to diagnose and resolve it?
Rollback	If the first deployment breaks something, what is the exact rollback procedure? How long does it take? Have you practiced it?
Dependencies	If any one of your upstream dependencies goes down, what happens to your service? Does it degrade gracefully or does it cascade?
Scale	What is the expected traffic? What is the maximum you have load-tested to? What happens when you exceed that?
Data	If your database goes down and comes back up, do you need to replay anything? Can you? How long does it take?
Security	What can an attacker do with access to this service? Have you reviewed the OWASP top 10 for your specific threat model?
On-call	Who is the primary on-call for this system after launch? Has that person agreed to it? Are they ready?

The two failure modes of PRRs

Too lenient: The PRR becomes a rubber stamp. Teams fill in answers quickly, reviewers do not push back, the checklist gets ticked and nobody is really safer. This happens when the PRR is treated as a compliance exercise rather than a genuine safety checkpoint.

Too strict: The PRR becomes a blocker. Teams dread it, try to avoid it, or launch without going through it because the process takes too long. This happens when the PRR tries to achieve perfect readiness instead of adequate readiness.

The right calibration is: a PRR should catch the things that will definitely cause incidents, and defer everything else to post-launch improvements. A system with known gaps that are documented and tracked is far better than a system with unknown gaps.

Real Example Pattern

Google's SRE team uses a concept called "SRE launch readiness review" where the SRE team commits to supporting a service only after it meets specific standards. The incentive is not fear of the gate — it is that once you pass, you get expert SRE support. The review is framed as a benefit to earn, not a hurdle to clear.

Not every team has dedicated SRE capacity to offer this incentive, but the framing principle holds: make passing the PRR feel like gaining something, not just avoiding punishment.

Who should run the PRR

The team that built the system should not be the only ones running their PRR. It is too easy to have blind spots about your own work. The best PRR involves at least one person who has on-call experience with similar systems and who is willing to ask "but what if this specific thing goes wrong?" without worrying about being annoying.

The SRE Model — and What It Actually Means

Site Reliability Engineering (SRE) is often described as "what happens when a software engineer designs operations." That framing is useful but incomplete. SRE is fundamentally about making reliability a shared engineering responsibility, not an operational afterthought.

There are two main ways teams implement SRE principles:

Centralized SRE teams

A dedicated team of reliability engineers who support multiple product teams. They set standards, own the platform layer (monitoring, deployment pipelines, cluster infrastructure), and are brought in to support high-stakes launches.

The upside: deep expertise concentrated in one place. The downside: it is easy for product teams to treat reliability as "the SRE team's problem." When something breaks at 2 AM, the product engineer who wrote the code is not the one getting paged — and so they have less incentive to write code that does not break at 2 AM.

Embedded SRE (or the "you build it, you run it" model)

Reliability engineering is part of every product team's job. Engineers who build features are also on-call for those features. There is no separate operations team to hand off to.

This creates the strongest possible incentive alignment. Engineers who have been paged at 2 AM for their own code write very different code the next day. The on-call experience is the fastest feedback loop for operational quality that exists.

Common Mistake

Teams adopt "you build it, you run it" without first giving engineers the tools and training to do the running part. The result is burned-out engineers, not reliable systems. The model only works if teams also invest in runbooks, good alerting, and time to fix the things that keep paging them.

Error budgets — the most powerful idea in SRE

An error budget is the flip side of an SLO. If your SLO says "99.9% of requests must succeed," then your error budget is the 0.1% of requests that are allowed to fail. That is about 43 minutes of downtime per month.

Here is why this matters: error budgets transform reliability from an abstract goal into a shared resource that teams make real trade-off decisions about.

Without error budgets, the reliability vs. velocity conversation sounds like this: "We need to ship this feature." "But it might make the system less reliable." "How much less reliable?" "...I don't know." "So we'll ship it."

With error budgets, the same conversation sounds like this: "We need to ship this feature." "Our error budget is 80% consumed this month. If we ship and there's a rough rollout, we'll burn through the rest and have to freeze new features until next month. Is the feature worth that trade-off?"

That second conversation is productive. Both people are arguing about the same concrete thing. The error budget is the shared language.

Key Mechanic

When you have budget remaining, you can move fast — ship features, take risks, experiment. When your budget is nearly exhausted, you slow down — focus on stability, fix the things causing errors, hold off on risky changes. The budget itself is the throttle. You do not need a manager deciding when to go fast and when to be careful. The data decides.

Toil — the hidden tax on your team's time

In SRE thinking, toil is a specific term: it means work that is manual, repetitive, automatable, and produces no lasting value. Toil is different from overhead (meetings, planning) and different from engineering work (building new things). Toil is the category of work that should not exist.

Examples of toil: manually restarting a service that crashes every week. Manually copying data from one system to another every morning. Manually approving every deployment. Manually updating a config file before every release.

Why does toil matter for reliability? Two reasons. First, high toil means your engineers spend less time on work that makes the system better. Second, toil is error-prone — manual steps get missed, done in the wrong order, or skipped when the engineer is tired. Automation is more reliable than humans for repetitive tasks.

Google's SRE team has a rule of thumb: if more than 50% of an engineer's time is toil, something is structurally wrong and needs to be fixed as a priority. Most teams do not measure toil explicitly, but the principle is sound: track it, set a target, and systematically eliminate it.

Blameless Culture — What It Really Means

"Blameless post-mortems" has become something of a buzzword, to the point where many teams use the word without really practicing the thing. Let's be precise about what blameless means, and equally importantly, what it does not mean.

"Blameless does not mean consequence-free. It means the investigation asks 'what in the system made this mistake easy to make?' before it asks 'who made the mistake?'"

Why blame is counterproductive

When an incident happens and the first response is to identify who is responsible, a few things reliably occur:

People start protecting themselves rather than investigating honestly. The engineer who made the mistake may know the most about what happened and why, but they are now the least likely to share that knowledge openly. Information gets withheld. Partial truths get told. The root cause stays hidden.

Meanwhile, the focus on the person rather than the system means the systemic fix never gets made. The next engineer to touch that part of the code faces the same hidden traps, the same missing safeguards, the same environment that made the mistake easy to make in the first place. And you get the same incident again, with a different face attached.

The systemic lens

Blameless post-mortems assume that engineers are generally competent people trying to do good work. If a competent person made a mistake, the interesting question is not "why did they do that?" but "what in the environment made that the easy path?"

Illustrative Pattern

An engineer runs a database migration script in production instead of staging. The blame framing: "They should have checked which environment they were in." The systemic framing: "Why do production and staging share the same connection string format? Why does the script not require a confirmation prompt in production environments? Why does our tooling not make it obvious which environment you are connected to? Why did the code review not catch this?"

The blame framing produces one fix: tell engineers to be more careful. The systemic framing produces five fixes, each of which makes the mistake harder to make in the future. Only the systemic framing actually improves reliability.

What blameless does not mean

Blameless does not mean you ignore patterns of behavior. If the same engineer makes the same class of mistake repeatedly despite coaching and systemic improvements, that is a performance conversation, not a post-mortem conversation. Blameless culture is not about protecting people from accountability — it is about making sure the investigation happens before the judgment does.

It also does not mean you avoid describing what happened in detail. A good post-mortem clearly lays out the timeline of events, including which actions made things better or worse. Being accurate and factual is not the same as assigning blame.

The management behavior that makes or breaks this

Here is the harsh truth: a team will never be more blameless than its manager's reaction to the last incident.

If a manager, in private or in public, expresses frustration about the person who caused an incident — even once — engineers will notice. They will remember. The next time something breaks, the on-call engineer will spend the first ten minutes of the incident trying to figure out whether they are going to get in trouble, instead of spending those ten minutes diagnosing the problem. You have now made your incident response slower and your post-mortems less honest.

Building blameless culture requires managers to consistently model the right behavior: curiosity about what happened, focus on systemic causes, explicit protection of engineers who share difficult information honestly. This has to happen every time, not just when it is easy.

Post-Mortems That Actually Change Things

Most teams write post-mortems. Far fewer teams write post-mortems that are worth writing.

The telltale sign of a post-mortem that is not working: the action items are vague, unassigned, or never completed. "Improve monitoring" is not an action item. "Add an alert that pages when the error rate on /checkout exceeds 5% for more than 2 minutes, assigned to Sam, due by end of next sprint" is an action item.

The anatomy of a good post-mortem

A post-mortem document has roughly five parts, and they matter in this order:

Impact statement — What was the user-visible impact? How long did it last? How many users or requests were affected? Be specific and honest. Do not soften the numbers.
Timeline — What happened, in exact chronological order, from the first sign of the problem to full resolution. Include what people did, what tools showed, and when decisions were made. The timeline is the raw material for the analysis.
Root cause analysis — What ultimately caused the incident? Use the "five whys" technique to get past the surface-level cause. "The service crashed" is not a root cause. "The service crashed because it ran out of memory, because a new feature allocated large objects per request without releasing them, because there was no load test that ran the new code path under sustained load" is closer to a root cause.
Contributing factors — What else made this incident worse, longer, or harder to detect? Maybe alerts were too noisy to take seriously. Maybe the runbook pointed to a service that no longer exists. Maybe the on-call rotation had not been trained on this part of the system. These are all actionable findings.
Action items — Specific, owned, time-bounded changes that will make this class of incident less likely, less severe, or faster to resolve. Each item has one owner (not a team), a due date, and a clear definition of done.

The five-why trap

The "five whys" technique is useful but can mislead you. If you always ask "why" in a linear chain, you will always end up at a single root cause — and often at a human decision. "Why did the engineer deploy on Friday afternoon?" is a question that leads to blaming the engineer.

Better post-mortems use the five whys to explore multiple causal chains, not just one. A real incident usually has several things that all had to go wrong simultaneously. The fix should address each of those things, not just the last link in one chain.

Following up on action items

The most common failure mode for post-mortems is not bad analysis — it is good analysis that produces action items that never get done. Two practices help here:

First, track post-mortem action items in the same place you track engineering work (your sprint board, your backlog). Post-mortem items that live only in the post-mortem document are invisible to the rest of the team and get deprioritized at the first sign of schedule pressure.

Second, review open post-mortem action items in your regular engineering meeting once a month. If an item has been open for three months, either it needs resources and prioritization, or it was not really important enough to fix. Both outcomes are fine — but you need to consciously make that choice rather than letting it quietly slip forever.

Game Days — Practicing Failure

A game day is a planned exercise where you deliberately break something in your system and then practice responding to it. The goal is simple: the first time your team has to deal with a database failover, a cache flush, or a dependency going down should not be during a real incident.

This is not a new idea. Firefighters train in burning buildings. Pilots practice engine failures in simulators. Surgeons learn on cadavers. Every high-stakes profession practices failure in a controlled environment before facing it in the real one. Software engineering is oddly resistant to this, perhaps because we like to believe that our systems are too different to fail in predictable ways. They are not.

What a game day looks like

A basic game day has four phases:

Game Day Structure

Announce the scenario — Tell the team what is going to break, at what time, and in which environment. (For more advanced game days, you can make this a surprise — but start with announced scenarios.)
Execute the failure — Actually break the thing. Kill the process. Saturate the disk. Block the network. Drop the database connection. Make it real.
Respond as if it were a real incident — The on-call engineer follows the runbook. They use the same tools they would use at 3 AM. No shortcuts, no one whispering hints from the side.
Retrospective — What did the runbook get right? What was missing? What was confusing? How long did detection and diagnosis take? What would have made this faster?

What game days reveal

Almost every team that runs their first game day is surprised by what they find. Common discoveries:

The runbook steps assume access to a tool that only one person has the credentials for. The alert for the broken thing does not fire, or fires ten minutes too late, or fires with a message so vague that nobody knows where to start. The rollback procedure works in theory but takes 45 minutes in practice — during which time the service is down. The team finds themselves needing to reach someone on Slack whose timezone means they are asleep.

None of these discoveries are failures of the game day. They are exactly the point. Finding them in a controlled exercise on a Tuesday afternoon is dramatically cheaper than finding them during a real incident on a Saturday night.

Chaos engineering vs. game days

Chaos engineering is the automated, continuous version of game days. Tools like Netflix's Chaos Monkey randomly terminate instances in production to ensure the system has learned to be resilient to instance failure. Game days are manual, intentional, and team-learning-focused. Chaos engineering is automated, continuous, and system-resilience-focused.

Start with game days. Get good at recovering from known failure modes. Then, once you have confidence in your system's resilience, add automated chaos to continuously verify that confidence as the system evolves.

Before You Start Chaos Engineering

Chaos engineering in production requires mature observability, well-tested recovery paths, and organizational trust. Running it too early — before your runbooks are solid, before your alerts are trustworthy, before your rollback procedures are practiced — turns chaos engineering from a resilience tool into a reliability hazard. Get the fundamentals right first.

On-Call as a Feedback Loop

On-call is uncomfortable. Engineers get woken up, interrupted at dinner, stressed during incidents. Most teams treat on-call as an unavoidable cost of running software in production — something to be endured, not designed.

That framing misses something important. On-call is the most direct feedback loop between the quality of what you build and the quality of your life. Engineers who are on-call for the systems they build receive immediate, visceral feedback when those systems are hard to operate. That feedback, if you use it, is the fastest driver of operational improvement that exists.

Designing on-call for learning, not just coverage

A good on-call rotation is not just about covering the hours. It is about making sure the right information flows to the right people.

Engineers who only build should occasionally be on-call. Engineers who are always on-call and never build will eventually burn out and leave — and the system will be no better for their time. The ideal is a rotation where everyone who makes architectural decisions gets to experience the operational consequences of those decisions. Not constantly, but regularly enough that it stays real.

The on-call weekly review

One of the highest-leverage rituals a team can adopt is a weekly review of the previous week's on-call experience. Every alert that fired gets examined: was it actionable? Was the runbook current? Was the fix actually applied, or just the symptom patched? This review takes 30 minutes and generates more real reliability improvement than most sprint cycles.

The product of this review is not a report — it is a small set of concrete tasks: "Update runbook for X. Silence noisy alert for Y and fix root cause instead. Add alert for Z that we caught manually but should be automated." Tasks that go straight into the sprint.

Making Culture Stick — The Long Game

Everything in this chapter — blameless post-mortems, game days, error budgets, production readiness reviews — is a practice. Practices do not stick from a single all-hands announcement or a policy document. They stick because of repeated behavior, visible leadership modeling, and structural reinforcement.

What leaders have to do

Reliability culture requires active, consistent leadership behavior. Not just permission, but participation.

Leaders who want a blameless culture have to visibly be blameless themselves — in public, in incident reviews, in private conversations that engineers will hear about anyway. Leaders who want teams to take game days seriously need to attend them, take them seriously, and fund the time to act on what they find. Leaders who want error budgets to change behavior need to respect those budgets in their own prioritization decisions, not override them whenever a product deadline feels important.

Culture is what leaders do, not what leaders say. Engineers are very good at noticing the difference.

The reliability flywheel

When reliability culture is working, it creates a flywheel effect. Better post-mortems produce better fixes. Better fixes reduce the on-call burden. A lighter on-call burden means engineers have more time for reliability work. More reliability work means fewer incidents. Fewer incidents mean lighter post-mortem load. The system improves itself.

Getting the flywheel spinning requires an initial investment — time for post-mortems, time for game days, time for runbook updates, time to automate toil. This investment always competes with feature work. It will always look, in the short term, like it is slowing down delivery. The teams that make this investment and hold the line on it are the ones that, twelve months later, have fewer incidents, faster response times, and engineers who are not burned out.

The teams that do not make this investment are the ones with the heroes. And the heroes are already starting to think about leaving.

Chapter Summary

The Key Principle

Reliability is an organizational property, not a technical one. You can have excellent distributed systems knowledge and still build unreliable systems if the team culture does not support honest post-mortems, shared ownership of operational quality, and deliberate practice of failure response.

The Most Common Mistake

Adopting the language of reliability culture — "blameless," "error budgets," "SRE" — without changing the behaviors that actually make culture. Blamelessness announced in an all-hands and then violated in a 1:1 is worse than no announcement, because engineers now know the stated values and the real values are different.

Three Questions for Your Next Review

Who are the heroes on your team, and what specific systemic problems does their heroism mask?
What did your last three post-mortems produce, and how many of those action items are actually done?
When did your team last deliberately break something on purpose to practice recovering from it?

Building a Culture of Reliability

What's in this chapter

Key Learnings