What's in this chapter
We will look at documentation as an engineering discipline — not a chore you do at the end of a project, but a set of deliberate choices that determine whether your system remains operable years after you wrote it.
- Why most documentation efforts fail, and the four root causes
- Architecture Decision Records (ADRs) — the highest-value documentation practice most teams skip
- The difference between runbooks, playbooks, and post-mortems, and when each one matters
- The documentation that actually gets read: READMEs, API contracts, on-call guides
- Self-documenting systems — how observability, naming, and structure can replace prose
- The decay problem: why wrong documentation is worse than no documentation, and how to fight it
Key Learnings — Read This If You're Short on Time
Why Documentation Fails
Most engineering teams know they should document their systems better. They try. They fail. Six months later, the documentation is wrong, no one trusts it, and engineers have stopped reading it. This cycle repeats on almost every team.
The reason is not laziness. The reason is that most teams treat documentation as an output — something you produce after you've built the system, to describe what you built. That framing is wrong, and it guarantees failure.
Documentation fails for four specific reasons. Understanding them tells you exactly what to fix.
No clear audience
A document written "for everyone" is useful to no one. The engineer on-call at 3am has completely different needs from the new hire trying to understand the system on day one, or the product manager trying to understand what the service can and cannot do.
Each of these people needs different information, presented at a different level of detail, with a different structure. A runbook and an architecture overview are both documentation. They have almost nothing else in common.
No owner
If a document has no single owner, it has no owner. "The team owns it" means no one will update it when the system changes, and no one will delete it when it becomes wrong. Someone's name needs to be on every piece of documentation — not as a bureaucratic formality, but because that person is responsible for keeping it correct.
No home
Documentation scattered across Confluence pages, Google Docs, internal wikis, Notion, README files, and Slack messages is documentation that cannot be found. Engineers will not search five systems for the answer to a question. They will ask a colleague instead — which means the knowledge stays in people's heads, which is exactly the problem documentation is supposed to solve.
Written once, never updated
A document written on the day a system ships is the most accurate it will ever be. From that point on, every code change, every infrastructure change, every team change makes it slightly less accurate. Without a process to update it, it decays. In a fast-moving system, it can become dangerously wrong within a year.
The antidote to all four failures is the same: treat documentation as a first-class engineering artifact. Give it an audience, an owner, a home close to the code it describes, and a review process. Write it before launch, not after.
Architecture Decision Records
An Architecture Decision Record — ADR for short — is a short document that captures one architectural decision. Not the whole system design. One decision. Why you faced it, what options you had, what you chose, and why.
ADRs are the single highest-value documentation practice most teams never do. The reason they matter so much is that code captures what you built, but not why. Two years after a decision was made, no one remembers why the system works the way it does. The original authors have moved on, or they remember their conclusion but not the reasoning. Newcomers inherit a system full of constraints and patterns that look arbitrary.
Without ADRs, history repeats. Someone new to the team sees what looks like a bad design decision and proposes to fix it. The team spends three weeks relitigating a decision that was made years ago, eventually rediscovers why the "obvious" fix doesn't work, and ends up back where they started — except now they're three weeks behind and slightly more frustrated.
What goes in an ADR
The format is simple. There are only a few sections that matter.
# ADR-014: Use Kafka for the event pipeline instead of direct DB writes Status: Accepted (other states: Proposed, Deprecated, Superseded by ADR-031) Date: 2024-03-12 ## Context The order service currently writes events directly to the database and the downstream services poll for changes. At 50k orders/day this is fine. At our projected 500k orders/day by Q4, polling will create unacceptable load on the primary DB. We need a different approach. ## Options Considered Option A: DB polling with read replicas Pros: No new infrastructure. Simple to reason about. Cons: Still couples consumers to DB schema. Doesn't solve fan-out. At 10 consumers, read replica load grows 10x. Option B: Change Data Capture (CDC) with Debezium Pros: Zero-latency events from DB changes. No app-level changes. Cons: Couples event schema to DB schema — hard to evolve. Debezium operationally complex; our SRE team has no experience with it. Option C: Kafka (chosen) Pros: Decouples producers from consumers. Consumers can replay. Persistent log means consumers can catch up after downtime. Team has existing Kafka expertise from the analytics pipeline. Cons: New infrastructure dependency. More operational overhead than B. ## Decision Use Kafka. The replay capability and consumer decoupling outweigh the operational cost, especially given existing team expertise. ## Consequences - Order service must publish events to Kafka on every state change. - All downstream consumers must migrate from DB polling by Q3. - We accept eventual consistency between the event log and DB state. - Ops team needs Kafka monitoring and alerting set up before launch. ## Review Date Revisit if order volume exceeds 2M/day or if Kafka operational costs become a significant fraction of team time.
Notice a few things about this format. The context section describes the problem as it existed at the time. Future readers need to understand the constraints that were real at that moment — not the current state of the system. The options section shows your thinking, including the options you rejected and why. And the consequences section is honest about the trade-offs you accepted, not just the benefits you gained.
When to write an ADR
Not every decision needs an ADR. The rule of thumb is: write one when the decision is hard to reverse, when reasonable engineers could disagree about the right answer, or when the reasoning is not obvious from the code.
You do not need an ADR to decide which variable naming convention to use, or which HTTP status code to return for a validation error. You do need one to decide whether to build a new service or add functionality to an existing one, to choose a storage engine for a new data model, or to decide how to handle backwards compatibility for a public API.
Store ADRs in the same Git repository as the code they describe, in a folder called docs/decisions/ or adr/. This means they show up in code reviews, they are versioned alongside the code, and they are found by engineers who are already looking at the code. An ADR in Confluence will not be found by someone reading the code at midnight. An ADR in the repo will be.
The "Alternatives Considered" section is your defense shield
The most valuable part of an ADR is the options you rejected and why. This section prevents the same decision from being relitigated every time a new engineer joins the team or a new leader reviews the system.
Without it, when someone asks "why didn't you just use X?" you either have to remember the reasoning from three years ago, or you have to re-investigate from scratch. With a good ADR, the answer is: "Here's the document. Option B was considered. Here's why it was rejected."
This is not bureaucracy. It is institutional memory.
Operational Documentation
Operational documentation is the kind that matters most when things go wrong. It is read under stress, by people who may not be deeply familiar with the system, at times when every minute of delay has a direct cost.
Most teams conflate three things that are actually distinct: runbooks, playbooks, and post-mortems. They serve different purposes and have different audiences.
| Document | When it's read | Audience | Purpose |
|---|---|---|---|
| Runbook | During an incident, alert just fired | On-call engineer, possibly unfamiliar with the service | Step-by-step: what to check, what to do, how to verify it worked |
| Playbook | During incident planning or drills | Incident commander, senior engineer | Broader response plans for entire classes of incidents (e.g., data center failure) |
| Post-mortem | After an incident, during review | Team, leadership, other teams who depend on you | What happened, why, what we're doing so it doesn't happen again |
Writing runbooks that actually work at 3am
A runbook is a script for an engineer who has been woken up, is slightly disoriented, and needs to resolve an alert in the shortest time possible. It is not a tutorial. It is not a chance to explain how the system works. It is a set of steps.
The test for a good runbook is simple: hand it to a competent engineer who has never seen this service before. Can they follow it and resolve the alert? If they need to ask questions, the runbook has failed.
Good runbooks have a specific structure. They start with what the alert means — not technically, but in plain terms. They then give exact commands to diagnose the issue. They tell the engineer what they are looking for in the output. They give the remediation steps in order. And they end with either a verification step ("the alert should clear within 5 minutes") or an explicit escalation path ("if this doesn't resolve it, page the database team").
# Alert: OrderService — HighProcessingLatency What this means Order processing p99 latency is above 2000ms. New orders are still being accepted but customers may see slow confirmation times. Step 1: Check if the issue is DB-related $ kubectl exec -it orderservice-pod -- psql $DB_URL -c "SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';" If count > 5: jump to Section A (DB lock contention) If count = 0: continue to Step 2 Step 2: Check the downstream payment service $ curl -s https://payments-internal/health | jq '.latency_p99' If latency_p99 > 1000: the payment service is degraded. Page the payments on-call (see #oncall-payments in Slack). This is NOT our issue to fix. If latency_p99 < 200: continue to Step 3 Step 3: Check for pod memory pressure $ kubectl top pods -n orders | sort -k3 -rn | head -10 If any pod is above 90% memory: restart it with: $ kubectl rollout restart deployment/orderservice -n orders Monitor for 5 minutes. Alert should clear. Escalation If none of the above resolves the alert within 15 minutes: → Page the Orders team lead (PagerDuty: @orders-lead) → Post in #incidents with current findings so far
Notice that this runbook does not explain how the order service works. It does not describe the architecture. It gives exact commands with exact thresholds and exact next steps depending on the output. That is what an engineer needs when they are half-asleep and an alert is firing.
Writing good post-mortems
A post-mortem serves two purposes. The first is immediate: it records what happened so the people involved can learn from it. The second is longer-term: it is a reference document that tells future engineers what kinds of failures this system has experienced and what was done about them.
A post-mortem should be written within 48 hours of an incident. Memory fades quickly. The on-call engineer who handled it should write the first draft. Other team members then add context and review for accuracy.
The structure matters. A good post-mortem has five sections. Timeline: what happened in what order, with exact timestamps. Impact: what was affected, for how long, how many users or requests. Root cause: not "a bug" or "human error" — a specific technical explanation of why the system behaved the way it did. Contributing factors: the conditions that made the bug possible or that made the impact worse. Action items: specific, assigned, time-bounded tasks to prevent recurrence.
"Human error" is almost never a useful root cause. If an engineer ran the wrong command and caused an incident, the real question is: why was it possible to run that command in production without confirmation? Why did the runbook not prevent it? Why did the system not have a safeguard? Human error is a symptom. The root cause is always a missing safeguard, a missing process, or a missing check.
The post-mortem should also be blameless. This means the goal is not to identify who made a mistake. It is to understand what conditions made the mistake possible, and to change those conditions. Blameless post-mortems encourage engineers to report incidents fully and honestly. A culture of blame encourages hiding incidents, partial reporting, and a team that is afraid to touch sensitive parts of the system.
Documentation That Actually Gets Read
There is a category of documentation that engineers actually read, as opposed to documentation that gets written, filed, and ignored. The difference is mostly about timing and format.
The README: your system's front door
The README is the first thing a new engineer sees. It is also the document most likely to get read, because it lives in the repository and comes up naturally during onboarding. This makes it the most important documentation you have.
A good README passes the five-minute test: a competent engineer who has never seen this service before can read it and, within five minutes, understand what the service does, how to run it locally, and where to find more detailed information. It is not comprehensive — it is a starting point.
The sections that matter in a README are: what this service does (one paragraph), who depends on it and what it depends on, how to run it locally in three commands or fewer, where the architecture documentation lives, and how to reach the team. Everything else is optional.
Do not put operational procedures in the README. Do not put architecture deep-dives in the README. A README that tries to cover everything ends up being so long that new engineers skim it and miss the important parts. Keep it short. Link out to other documents for everything else.
API contracts as documentation
An API contract is the most important documentation a service can have for its consumers. It describes exactly what the service accepts and what it returns — the field names, the types, the error codes, the semantics.
The key insight about API documentation is that it is a commitment, not just a description. When you document an API, you are telling every caller what they can rely on. If you change it without updating the documentation, you are breaking that commitment — and potentially breaking their code.
The best API documentation is generated from the code. OpenAPI (Swagger) for REST APIs, proto files for gRPC, and GraphQL schemas are all forms of documentation that live in the code and are always in sync with it. When the code changes, the documentation changes with it. This is the only format of API documentation you can actually trust.
What generated documentation cannot capture is semantics. It can tell you that a field is a string. It cannot tell you that the string must be a valid ISO 4217 currency code, or that it defaults to "USD" when absent, or that it was deprecated in v2 and will be removed in v3. That context must be written by hand and kept up to date.
The on-call guide
An on-call guide is a companion document to the runbooks. Where a runbook covers a specific alert, the on-call guide gives an engineer the background they need to be effective on-call for an unfamiliar service. It is written for the engineer who is about to start their first on-call shift and has read the runbooks but still feels uncertain.
A good on-call guide covers: the service's critical paths (which requests matter most), the dependencies that are most likely to cause problems, the dashboards to look at first when something seems wrong, the known flaky behaviors that look alarming but are normal, and the escalation path for things that cannot be resolved quickly.
It is not a technical deep-dive. It is a confidence-builder that tells an engineer "you are ready to handle this, and here is where to start."
Self-Documenting Systems
The best documentation is the kind you do not need to write because the system explains itself. This sounds aspirational, but it is a concrete engineering goal with concrete techniques.
Observability as documentation
A well-instrumented system answers questions without requiring anyone to explain it. When an engineer looks at a dashboard and immediately understands what is happening, the metric names and dashboard layout are doing documentation work. When a log message includes the full context of what happened and why, it removes the need for a human to reconstruct that context later.
Consider two log messages for the same event:
ERROR: payment failed for user 44821
ERROR payment_processor.charge_failed {
"user_id": 44821,
"order_id": "ord_9f3k2",
"amount_cents": 4999,
"currency": "USD",
"payment_provider": "stripe",
"error_code": "card_declined",
"decline_reason": "insufficient_funds",
"attempt_number": 1,
"will_retry": true,
"next_retry_at": "2024-03-12T14:35:00Z",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}
The second log message answers every question an engineer would have: what failed, for whom, how much, through which provider, why, and what happens next. No one needs to read documentation to understand this log entry. The log entry is the documentation.
The same principle applies to metrics. A metric named http_requests_total is less useful than order_processing_requests_total. A metric with labels {status="success", payment_provider="stripe", order_type="subscription"} tells you exactly what it measures without any additional explanation.
Naming as documentation
Names are the most persistent documentation in a codebase. A function called process() requires documentation to explain what it does. A function called charge_customer_and_emit_order_event() mostly explains itself.
This is not an argument for long names everywhere. It is an argument for names that carry their meaning. The right question when naming anything — a function, a metric, a service, a Kafka topic, a database table — is: "If someone unfamiliar with this system sees only this name, what will they conclude?"
Consistent naming conventions act as documentation at the system level. If every Kafka topic is named {domain}.{entity}.{event}, then a topic named orders.payment.charged is immediately understood by anyone who knows the convention. No documentation needed.
Structure as documentation
How a repository is organized communicates assumptions about the system. A repo with a flat structure of 200 files communicates that the system has no meaningful subdivisions. A repo organized by domain, with clear boundaries between modules, communicates ownership and separation of concerns.
When new engineers look at a well-structured codebase, they can navigate it without asking for help. That navigation ability is a form of documentation. It tells them where things belong, what is allowed to depend on what, and where to put new code.
The Decay Problem
Documentation decays. The moment you write it, it begins to drift from reality. A system that changes frequently can render a document misleading within months. This is not a failure of the people who wrote the documentation. It is the natural behavior of any artifact that is not automatically kept in sync with what it describes.
The decay problem is serious because stale documentation is not just useless — it actively misleads. An engineer who reads an outdated runbook and follows it step by step may execute commands against the wrong service, miss a critical step that was added after an incident, or fail to notice a dependency that has changed. Outdated API documentation causes integration bugs. Outdated architecture diagrams cause new engineers to build components in the wrong place.
Wrong documentation is worse than no documentation. At least when there is no documentation, engineers know to ask questions. When there is documentation that looks authoritative and is wrong, engineers follow it without questioning — straight into a problem.
Five strategies against decay
1. Documentation close to code
Documentation stored in the same repository as the code it describes gets updated alongside the code — or at least, it shows up in the diff when the code changes, reminding the author to update it. Documentation stored in a separate wiki gets updated only when someone remembers to update it, which is not often enough.
2. Docs-as-code
Treat documentation with the same rigor as code. Documentation changes go through code review. A pull request that changes behavior without updating the relevant documentation is incomplete. This sounds strict, but it is the only reliable way to keep operational documentation correct.
3. Explicit owners and review dates
Every document should have a named owner and a date at which it will be reviewed for accuracy. A runbook that has not been reviewed in 18 months should be treated as potentially outdated. A quarterly calendar review of critical operational documentation catches most decay before it becomes dangerous.
4. Delete aggressively
The second-best response to outdated documentation is to delete it. An empty page is less dangerous than a wrong page. If you do not have the bandwidth to update a document, delete it and leave a note pointing to the person who knows the current state. This is uncomfortable but correct.
5. Generate what you can
Any documentation that can be automatically generated from the code should be. API schemas, dependency graphs, database schemas, configuration references — all of these can be generated. Generated documentation is always up to date by definition. This frees the team to focus manual documentation effort on the things that cannot be automated: reasoning, context, trade-offs, and history.
The documentation review in a production readiness checklist
The most effective way to fight decay at the system level is to make documentation a launch gate, not an afterthought. Before a new service goes to production, the following should exist and be reviewed:
- A README that passes the five-minute test
- Runbooks for every alert that the monitoring system can fire
- An on-call guide
- ADRs for every major architectural decision made during the build
- API documentation for every public interface
- A named owner for each of these documents
This is not a large amount of documentation for a new service. It is the minimum that makes the service operable by people who did not build it. Anything less is pushing a maintenance burden onto the future.
The cost of not documenting a system is not borne by the people who built it. It is borne by every engineer who works with the system after them. It shows up as slower onboarding, more incidents, more cautious and slower changes ("I don't understand this part well enough to touch it"), and more hours spent answering the same questions repeatedly. This cost compounds over years. A system maintained by ten engineers over five years will spend far more engineer-hours on undocumented confusion than it would have taken to write the documentation in the first place.
Chapter Summary
Key Principle
- Documentation is a system property, not a post-launch chore
- Code captures what. ADRs capture why
- Wrong documentation is worse than none
- The best docs are generated or self-evident
Most Common Mistake
- Writing comprehensive docs with no owner, which decay into liabilities
- Runbooks that require expertise to follow
- "Human error" as a root cause in post-mortems
- Documentation that lives far from the code it describes
Three Questions for Your Next Design Review
- Who is the named owner of each document we ship with this service?
- Can an engineer who has never seen this service follow our runbooks at 3am?
- Have we written an ADR for every decision that a new engineer would find surprising?