Chapter 21: Documentation as a System Property — Principles of Distributed Systems Design

What's in this chapter

We will look at documentation as an engineering discipline — not a chore you do at the end of a project, but a set of deliberate choices that determine whether your system remains operable years after you wrote it.

Why most documentation efforts fail, and the four root causes
Architecture Decision Records (ADRs) — the highest-value documentation practice most teams skip
The difference between runbooks, playbooks, and post-mortems, and when each one matters
The documentation that actually gets read: READMEs, API contracts, on-call guides
Self-documenting systems — how observability, naming, and structure can replace prose
The decay problem: why wrong documentation is worse than no documentation, and how to fight it

Key Learnings — Read This If You're Short on Time

The "what" is in the code. The "why" is nowhere. Code tells you what the system does. It cannot tell you why a decision was made, what alternatives were rejected, or what constraint forced a particular design. That context lives only in people's heads — until it doesn't, because people leave.

An Architecture Decision Record (ADR) is the highest-ROI document you can write. It captures one decision: the context, the options considered, and the reasoning behind the choice. Future engineers — including future you — will thank you for it every time the question "why is it done this way?" comes up.

Runbooks must work at 3am. A runbook that requires the reader to understand context, follow ambiguous steps, or make judgment calls under pressure is not a runbook — it's a liability. Good runbooks are step-by-step, short, and end with "escalate if this doesn't resolve it."

Stale documentation is an active hazard. An engineer who reads outdated docs and follows them into production can cause an outage. Wrong docs are worse than no docs. Every document you write needs an owner and a review process, or it will eventually mislead someone.

Documentation that lives far from the code will drift from the code. Store ADRs in the repo. Put runbooks near the service they describe. The closer documentation is to what it describes, the more likely it is to get updated when things change.

The best documentation is a system you don't need to explain. Good metric names, clear log messages, self-describing API responses, and consistent naming conventions can eliminate entire categories of "how does this work?" questions. Build this in from the start.

Post-mortems are documentation for your future self. The incident happened because some assumption was wrong. Write down what the assumption was, how you found out it was wrong, and what you changed. Future incidents of the same type should be resolved by reading the post-mortem, not by calling the person who handled the original.

Why Documentation Fails

Most engineering teams know they should document their systems better. They try. They fail. Six months later, the documentation is wrong, no one trusts it, and engineers have stopped reading it. This cycle repeats on almost every team.

The reason is not laziness. The reason is that most teams treat documentation as an output — something you produce after you've built the system, to describe what you built. That framing is wrong, and it guarantees failure.

Documentation fails for four specific reasons. Understanding them tells you exactly what to fix.

No clear audience

A document written "for everyone" is useful to no one. The engineer on-call at 3am has completely different needs from the new hire trying to understand the system on day one, or the product manager trying to understand what the service can and cannot do.

Each of these people needs different information, presented at a different level of detail, with a different structure. A runbook and an architecture overview are both documentation. They have almost nothing else in common.

No owner

If a document has no single owner, it has no owner. "The team owns it" means no one will update it when the system changes, and no one will delete it when it becomes wrong. Someone's name needs to be on every piece of documentation — not as a bureaucratic formality, but because that person is responsible for keeping it correct.

No home

Documentation scattered across Confluence pages, Google Docs, internal wikis, Notion, README files, and Slack messages is documentation that cannot be found. Engineers will not search five systems for the answer to a question. They will ask a colleague instead — which means the knowledge stays in people's heads, which is exactly the problem documentation is supposed to solve.

Written once, never updated

A document written on the day a system ships is the most accurate it will ever be. From that point on, every code change, every infrastructure change, every team change makes it slightly less accurate. Without a process to update it, it decays. In a fast-moving system, it can become dangerously wrong within a year.

Core insight

The antidote to all four failures is the same: treat documentation as a first-class engineering artifact. Give it an audience, an owner, a home close to the code it describes, and a review process. Write it before launch, not after.

Architecture Decision Records

An Architecture Decision Record — ADR for short — is a short document that captures one architectural decision. Not the whole system design. One decision. Why you faced it, what options you had, what you chose, and why.

ADRs are the single highest-value documentation practice most teams never do. The reason they matter so much is that code captures what you built, but not why. Two years after a decision was made, no one remembers why the system works the way it does. The original authors have moved on, or they remember their conclusion but not the reasoning. Newcomers inherit a system full of constraints and patterns that look arbitrary.

Without ADRs, history repeats. Someone new to the team sees what looks like a bad design decision and proposes to fix it. The team spends three weeks relitigating a decision that was made years ago, eventually rediscovers why the "obvious" fix doesn't work, and ends up back where they started — except now they're three weeks behind and slightly more frustrated.

What goes in an ADR

The format is simple. There are only a few sections that matter.

ADR Template

# ADR-014: Use Kafka for the event pipeline instead of direct DB writes

Status: Accepted  (other states: Proposed, Deprecated, Superseded by ADR-031)

Date: 2024-03-12

## Context
The order service currently writes events directly to the database and
the downstream services poll for changes. At 50k orders/day this is fine.
At our projected 500k orders/day by Q4, polling will create unacceptable
load on the primary DB. We need a different approach.

## Options Considered

Option A: DB polling with read replicas
  Pros: No new infrastructure. Simple to reason about.
  Cons: Still couples consumers to DB schema. Doesn't solve fan-out.
  At 10 consumers, read replica load grows 10x.

Option B: Change Data Capture (CDC) with Debezium
  Pros: Zero-latency events from DB changes. No app-level changes.
  Cons: Couples event schema to DB schema — hard to evolve.
  Debezium operationally complex; our SRE team has no experience with it.

Option C: Kafka (chosen)
  Pros: Decouples producers from consumers. Consumers can replay.
  Persistent log means consumers can catch up after downtime.
  Team has existing Kafka expertise from the analytics pipeline.
  Cons: New infrastructure dependency. More operational overhead than B.

## Decision
Use Kafka. The replay capability and consumer decoupling outweigh the
operational cost, especially given existing team expertise.

## Consequences
- Order service must publish events to Kafka on every state change.
- All downstream consumers must migrate from DB polling by Q3.
- We accept eventual consistency between the event log and DB state.
- Ops team needs Kafka monitoring and alerting set up before launch.

## Review Date
Revisit if order volume exceeds 2M/day or if Kafka operational costs
become a significant fraction of team time.

Notice a few things about this format. The context section describes the problem as it existed at the time. Future readers need to understand the constraints that were real at that moment — not the current state of the system. The options section shows your thinking, including the options you rejected and why. And the consequences section is honest about the trade-offs you accepted, not just the benefits you gained.

When to write an ADR

Not every decision needs an ADR. The rule of thumb is: write one when the decision is hard to reverse, when reasonable engineers could disagree about the right answer, or when the reasoning is not obvious from the code.

You do not need an ADR to decide which variable naming convention to use, or which HTTP status code to return for a validation error. You do need one to decide whether to build a new service or add functionality to an existing one, to choose a storage engine for a new data model, or to decide how to handle backwards compatibility for a public API.

Practical tip

Store ADRs in the same Git repository as the code they describe, in a folder called docs/decisions/ or adr/. This means they show up in code reviews, they are versioned alongside the code, and they are found by engineers who are already looking at the code. An ADR in Confluence will not be found by someone reading the code at midnight. An ADR in the repo will be.

The "Alternatives Considered" section is your defense shield

The most valuable part of an ADR is the options you rejected and why. This section prevents the same decision from being relitigated every time a new engineer joins the team or a new leader reviews the system.

Without it, when someone asks "why didn't you just use X?" you either have to remember the reasoning from three years ago, or you have to re-investigate from scratch. With a good ADR, the answer is: "Here's the document. Option B was considered. Here's why it was rejected."

This is not bureaucracy. It is institutional memory.

Operational Documentation

Operational documentation is the kind that matters most when things go wrong. It is read under stress, by people who may not be deeply familiar with the system, at times when every minute of delay has a direct cost.

Most teams conflate three things that are actually distinct: runbooks, playbooks, and post-mortems. They serve different purposes and have different audiences.

Document	When it's read	Audience	Purpose
Runbook	During an incident, alert just fired	On-call engineer, possibly unfamiliar with the service	Step-by-step: what to check, what to do, how to verify it worked
Playbook	During incident planning or drills	Incident commander, senior engineer	Broader response plans for entire classes of incidents (e.g., data center failure)
Post-mortem	After an incident, during review	Team, leadership, other teams who depend on you	What happened, why, what we're doing so it doesn't happen again

Writing runbooks that actually work at 3am

A runbook is a script for an engineer who has been woken up, is slightly disoriented, and needs to resolve an alert in the shortest time possible. It is not a tutorial. It is not a chance to explain how the system works. It is a set of steps.

The test for a good runbook is simple: hand it to a competent engineer who has never seen this service before. Can they follow it and resolve the alert? If they need to ask questions, the runbook has failed.

Good runbooks have a specific structure. They start with what the alert means — not technically, but in plain terms. They then give exact commands to diagnose the issue. They tell the engineer what they are looking for in the output. They give the remediation steps in order. And they end with either a verification step ("the alert should clear within 5 minutes") or an explicit escalation path ("if this doesn't resolve it, page the database team").

Example runbook structure

# Alert: OrderService — HighProcessingLatency

What this means
Order processing p99 latency is above 2000ms. New orders are still
being accepted but customers may see slow confirmation times.

Step 1: Check if the issue is DB-related
$ kubectl exec -it orderservice-pod -- psql $DB_URL -c "SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';"

  If count > 5: jump to Section A (DB lock contention)
  If count = 0: continue to Step 2

Step 2: Check the downstream payment service
$ curl -s https://payments-internal/health | jq '.latency_p99'

  If latency_p99 > 1000: the payment service is degraded.
  Page the payments on-call (see #oncall-payments in Slack).
  This is NOT our issue to fix.

  If latency_p99 < 200: continue to Step 3

Step 3: Check for pod memory pressure
$ kubectl top pods -n orders | sort -k3 -rn | head -10

  If any pod is above 90% memory: restart it with:
  $ kubectl rollout restart deployment/orderservice -n orders
  Monitor for 5 minutes. Alert should clear.

Escalation
If none of the above resolves the alert within 15 minutes:
  → Page the Orders team lead (PagerDuty: @orders-lead)
  → Post in #incidents with current findings so far

Notice that this runbook does not explain how the order service works. It does not describe the architecture. It gives exact commands with exact thresholds and exact next steps depending on the output. That is what an engineer needs when they are half-asleep and an alert is firing.

Writing good post-mortems

A post-mortem serves two purposes. The first is immediate: it records what happened so the people involved can learn from it. The second is longer-term: it is a reference document that tells future engineers what kinds of failures this system has experienced and what was done about them.

A post-mortem should be written within 48 hours of an incident. Memory fades quickly. The on-call engineer who handled it should write the first draft. Other team members then add context and review for accuracy.

The structure matters. A good post-mortem has five sections. Timeline: what happened in what order, with exact timestamps. Impact: what was affected, for how long, how many users or requests. Root cause: not "a bug" or "human error" — a specific technical explanation of why the system behaved the way it did. Contributing factors: the conditions that made the bug possible or that made the impact worse. Action items: specific, assigned, time-bounded tasks to prevent recurrence.

Common mistake

"Human error" is almost never a useful root cause. If an engineer ran the wrong command and caused an incident, the real question is: why was it possible to run that command in production without confirmation? Why did the runbook not prevent it? Why did the system not have a safeguard? Human error is a symptom. The root cause is always a missing safeguard, a missing process, or a missing check.

The post-mortem should also be blameless. This means the goal is not to identify who made a mistake. It is to understand what conditions made the mistake possible, and to change those conditions. Blameless post-mortems encourage engineers to report incidents fully and honestly. A culture of blame encourages hiding incidents, partial reporting, and a team that is afraid to touch sensitive parts of the system.

Documentation That Actually Gets Read

There is a category of documentation that engineers actually read, as opposed to documentation that gets written, filed, and ignored. The difference is mostly about timing and format.

The README: your system's front door

The README is the first thing a new engineer sees. It is also the document most likely to get read, because it lives in the repository and comes up naturally during onboarding. This makes it the most important documentation you have.

A good README passes the five-minute test: a competent engineer who has never seen this service before can read it and, within five minutes, understand what the service does, how to run it locally, and where to find more detailed information. It is not comprehensive — it is a starting point.

The sections that matter in a README are: what this service does (one paragraph), who depends on it and what it depends on, how to run it locally in three commands or fewer, where the architecture documentation lives, and how to reach the team. Everything else is optional.

Warning

Do not put operational procedures in the README. Do not put architecture deep-dives in the README. A README that tries to cover everything ends up being so long that new engineers skim it and miss the important parts. Keep it short. Link out to other documents for everything else.

API contracts as documentation

An API contract is the most important documentation a service can have for its consumers. It describes exactly what the service accepts and what it returns — the field names, the types, the error codes, the semantics.

The key insight about API documentation is that it is a commitment, not just a description. When you document an API, you are telling every caller what they can rely on. If you change it without updating the documentation, you are breaking that commitment — and potentially breaking their code.

The best API documentation is generated from the code. OpenAPI (Swagger) for REST APIs, proto files for gRPC, and GraphQL schemas are all forms of documentation that live in the code and are always in sync with it. When the code changes, the documentation changes with it. This is the only format of API documentation you can actually trust.

What generated documentation cannot capture is semantics. It can tell you that a field is a string. It cannot tell you that the string must be a valid ISO 4217 currency code, or that it defaults to "USD" when absent, or that it was deprecated in v2 and will be removed in v3. That context must be written by hand and kept up to date.

The on-call guide

An on-call guide is a companion document to the runbooks. Where a runbook covers a specific alert, the on-call guide gives an engineer the background they need to be effective on-call for an unfamiliar service. It is written for the engineer who is about to start their first on-call shift and has read the runbooks but still feels uncertain.

A good on-call guide covers: the service's critical paths (which requests matter most), the dependencies that are most likely to cause problems, the dashboards to look at first when something seems wrong, the known flaky behaviors that look alarming but are normal, and the escalation path for things that cannot be resolved quickly.

It is not a technical deep-dive. It is a confidence-builder that tells an engineer "you are ready to handle this, and here is where to start."

Self-Documenting Systems

The best documentation is the kind you do not need to write because the system explains itself. This sounds aspirational, but it is a concrete engineering goal with concrete techniques.

Observability as documentation

A well-instrumented system answers questions without requiring anyone to explain it. When an engineer looks at a dashboard and immediately understands what is happening, the metric names and dashboard layout are doing documentation work. When a log message includes the full context of what happened and why, it removes the need for a human to reconstruct that context later.

Consider two log messages for the same event:

Poor logging

ERROR: payment failed for user 44821

Self-documenting logging

ERROR payment_processor.charge_failed {
  "user_id": 44821,
  "order_id": "ord_9f3k2",
  "amount_cents": 4999,
  "currency": "USD",
  "payment_provider": "stripe",
  "error_code": "card_declined",
  "decline_reason": "insufficient_funds",
  "attempt_number": 1,
  "will_retry": true,
  "next_retry_at": "2024-03-12T14:35:00Z",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}

The second log message answers every question an engineer would have: what failed, for whom, how much, through which provider, why, and what happens next. No one needs to read documentation to understand this log entry. The log entry is the documentation.

The same principle applies to metrics. A metric named http_requests_total is less useful than order_processing_requests_total. A metric with labels {status="success", payment_provider="stripe", order_type="subscription"} tells you exactly what it measures without any additional explanation.

Naming as documentation

Names are the most persistent documentation in a codebase. A function called process() requires documentation to explain what it does. A function called charge_customer_and_emit_order_event() mostly explains itself.

This is not an argument for long names everywhere. It is an argument for names that carry their meaning. The right question when naming anything — a function, a metric, a service, a Kafka topic, a database table — is: "If someone unfamiliar with this system sees only this name, what will they conclude?"

Consistent naming conventions act as documentation at the system level. If every Kafka topic is named {domain}.{entity}.{event}, then a topic named orders.payment.charged is immediately understood by anyone who knows the convention. No documentation needed.

Structure as documentation

How a repository is organized communicates assumptions about the system. A repo with a flat structure of 200 files communicates that the system has no meaningful subdivisions. A repo organized by domain, with clear boundaries between modules, communicates ownership and separation of concerns.

When new engineers look at a well-structured codebase, they can navigate it without asking for help. That navigation ability is a form of documentation. It tells them where things belong, what is allowed to depend on what, and where to put new code.

The Decay Problem

Documentation decays. The moment you write it, it begins to drift from reality. A system that changes frequently can render a document misleading within months. This is not a failure of the people who wrote the documentation. It is the natural behavior of any artifact that is not automatically kept in sync with what it describes.

The decay problem is serious because stale documentation is not just useless — it actively misleads. An engineer who reads an outdated runbook and follows it step by step may execute commands against the wrong service, miss a critical step that was added after an incident, or fail to notice a dependency that has changed. Outdated API documentation causes integration bugs. Outdated architecture diagrams cause new engineers to build components in the wrong place.

Critical principle

Wrong documentation is worse than no documentation. At least when there is no documentation, engineers know to ask questions. When there is documentation that looks authoritative and is wrong, engineers follow it without questioning — straight into a problem.

Five strategies against decay

1. Documentation close to code

Documentation stored in the same repository as the code it describes gets updated alongside the code — or at least, it shows up in the diff when the code changes, reminding the author to update it. Documentation stored in a separate wiki gets updated only when someone remembers to update it, which is not often enough.

2. Docs-as-code

Treat documentation with the same rigor as code. Documentation changes go through code review. A pull request that changes behavior without updating the relevant documentation is incomplete. This sounds strict, but it is the only reliable way to keep operational documentation correct.

3. Explicit owners and review dates

Every document should have a named owner and a date at which it will be reviewed for accuracy. A runbook that has not been reviewed in 18 months should be treated as potentially outdated. A quarterly calendar review of critical operational documentation catches most decay before it becomes dangerous.

4. Delete aggressively

The second-best response to outdated documentation is to delete it. An empty page is less dangerous than a wrong page. If you do not have the bandwidth to update a document, delete it and leave a note pointing to the person who knows the current state. This is uncomfortable but correct.

5. Generate what you can

Any documentation that can be automatically generated from the code should be. API schemas, dependency graphs, database schemas, configuration references — all of these can be generated. Generated documentation is always up to date by definition. This frees the team to focus manual documentation effort on the things that cannot be automated: reasoning, context, trade-offs, and history.

The documentation review in a production readiness checklist

The most effective way to fight decay at the system level is to make documentation a launch gate, not an afterthought. Before a new service goes to production, the following should exist and be reviewed:

A README that passes the five-minute test
Runbooks for every alert that the monitoring system can fire
An on-call guide
ADRs for every major architectural decision made during the build
API documentation for every public interface
A named owner for each of these documents

This is not a large amount of documentation for a new service. It is the minimum that makes the service operable by people who did not build it. Anything less is pushing a maintenance burden onto the future.

The real cost of skipping documentation

The cost of not documenting a system is not borne by the people who built it. It is borne by every engineer who works with the system after them. It shows up as slower onboarding, more incidents, more cautious and slower changes ("I don't understand this part well enough to touch it"), and more hours spent answering the same questions repeatedly. This cost compounds over years. A system maintained by ten engineers over five years will spend far more engineer-hours on undocumented confusion than it would have taken to write the documentation in the first place.

Chapter Summary

Key Principle

Documentation is a system property, not a post-launch chore
Code captures what. ADRs capture why
Wrong documentation is worse than none
The best docs are generated or self-evident

Most Common Mistake

Writing comprehensive docs with no owner, which decay into liabilities
Runbooks that require expertise to follow
"Human error" as a root cause in post-mortems
Documentation that lives far from the code it describes

Three Questions for Your Next Design Review

Who is the named owner of each document we ship with this service?
Can an engineer who has never seen this service follow our runbooks at 3am?
Have we written an ADR for every decision that a new engineer would find surprising?