Chapter 37 · Part X: The Human System

The Decision-Making Framework for System Design

Every distributed system is a long chain of decisions. Some were made in a meeting room. Some were made in a Slack message at 11pm. Some were never made at all — they just happened. This chapter is about making those decisions deliberately, at the right speed, with the right people, and in a way you can defend six months later.

What's in this chapter

Why reversibility is the most important property of a decision
How to run an RFC process that produces real alignment
When to prototype instead of design — and when not to
The cost-of-information framework for choosing how much to invest upfront
How to resolve technical disagreements without damaging the team
What escalation is for, and when it becomes a crutch
The decision log — the artifact that saves projects
How good decision-making compounds over time

Key Learnings — Read This First

KL 1 Reversibility is the most important dimension of a decision — not urgency, not impact. How hard is it to undo? That determines how much time to spend before deciding.

KL 2 Slow down for one-way doors, speed up for two-way doors. Most decisions are two-way. Engineers dramatically over-deliberate on things they can change tomorrow.

KL 3 The RFC process is alignment infrastructure, not bureaucracy. A well-run RFC surfaces disagreements when they're cheap — on paper — instead of when they're expensive — in production.

KL 4 Prototype when the question is "can we?" — design when the question is "should we?" These require different tools. Mixing them up wastes weeks.

KL 5 Technical disagreements rarely survive being written down. Ask both sides to write a one-page case for their position. Most disagreements dissolve — they were actually about different assumptions, not the decision itself.

KL 6 Escalation should move a blocked decision, not avoid making one. If you're escalating to get someone else to own the risk, that's not escalation — it's avoidance.

KL 7 Write every non-obvious decision in a decision log. Not for the current team — for the engineer who joins in 18 months and wonders "why on earth did they do it this way?"

KL 8 "Disagree and commit" only works if the disagreement was stated out loud first. Quiet resentment is not commitment. It's a time bomb.

The Problem With How We Make Decisions

Imagine you're designing a new data pipeline. Three engineers have three different opinions on whether to use a message queue or a direct RPC call between services. You talk about it in a meeting for forty minutes. Someone says "let's just go with Kafka, we know it." Everyone nods. The meeting ends.

Six months later, the pipeline is in production. A new engineer asks why you're using Kafka for a use case where direct calls would have been simpler and cheaper. Nobody in the room remembers the reasoning. The original engineer who suggested Kafka has moved teams. The decision lives only in someone's fading memory of a meeting.

This is how most architectural decisions get made. Not badly, exactly — but invisibly. And invisible decisions accumulate into systems that seem arbitrary to anyone who joins later.

Good decision-making in distributed systems is not about always making the right call. You won't. Systems are too complex, and you never have complete information. Good decision-making is about three things: making the right decision at the right speed, surfacing disagreements before they become production bugs, and leaving behind enough context that future engineers can understand and change what you built.

The core insight

A decision made fast with a clear written rationale is almost always better than a decision made slowly with no written rationale. Speed matters. Documentation matters more.

The Reversibility Axis

The single most useful dimension to assess in any decision is: how hard is it to reverse?

Jeff Bezos called these "one-way door" and "two-way door" decisions. The framing is useful because it's concrete. You can walk back through a two-way door. You cannot walk back through a one-way door.

Here is the principle: the less reversible a decision, the more time you should spend making it. The more reversible, the faster you should decide and move on. Most engineers apply the exact opposite rule — they over-deliberate on trivial decisions and under-deliberate on the ones that matter.

Classifying decisions by reversibility

Decision	Type	Cost to Reverse	How to Treat It
Variable naming, code structure	Two-way	Minutes	Decide in seconds. Don't discuss.
Which logging library to use	Two-way	Hours	Decide in 30 minutes. One person.
REST vs. gRPC for a new internal API	Moderate	Days–weeks	Small discussion, write down the reason.
Synchronous vs. async communication between services	Moderate	Weeks	RFC or design doc section, explicit sign-off.
Primary data model for a core entity	One-way	Months + migration	Full design doc, alternatives considered, senior review.
Build vs. buy for core infrastructure	One-way	Quarters + org change	Spike, RFC, exec alignment, written decision.
Service boundary / ownership split	One-way	Quarters + politics	Treat like an organizational decision, not a technical one.

The common mistake

Teams spend three hours debating which logging library to use and thirty minutes deciding their data model. The energy is exactly backwards. The data model will still be with you in five years. The logging library can be swapped on a Saturday afternoon.

A practical way to apply this: before any discussion, someone should say "is this a one-way door or a two-way door?" If it's two-way, time-box the discussion to fifteen minutes and make a call. If it's one-way, block time, write something down, and get the right people in the room.

The RFC Process — Alignment Before Code

RFC stands for "Request for Comments." The term comes from the internet standards world, where every major protocol was designed through a written proposal that anyone could comment on. In a software team, the RFC process is simpler: before building something significant, write down what you plan to build and why, share it with the people it affects, and collect their objections before a line of code is written.

This is not bureaucracy. This is the cheapest possible way to surface problems. A comment on a document takes five minutes. Refactoring a deployed service because you didn't consult the right person takes five weeks.

What belongs in an RFC

An RFC is not a full design doc. It's lighter — typically two to five pages. Its job is to answer one question: given what I'm planning to do, is there anything obvious I'm missing or anything that would block another team?

RFC structure (5 sections)

1. Problem statement
   What are we trying to solve? Who is affected?
   One paragraph. Resist the urge to hint at the solution here.

2. Proposed approach
   What are you planning to do?
   High level — architecture diagram, data flow, key components.
   Not implementation details.

3. Alternatives considered
   What else did you think about and why did you reject it?
   Two or three options minimum. This is the section that earns
   you credibility with senior reviewers.

4. Open questions
   What are you explicitly NOT deciding in this RFC?
   What do you need feedback on specifically?

5. Impact on other teams
   Who needs to change something because of this?
   Name the teams and what you're asking of them.

How to run the review

The review process matters as much as the document itself. Here is the sequence that works:

Share informally first
Before the official review, send the draft to two or three people whose opinion you trust and who might have the sharpest objections. "I'd love early feedback before I share more broadly" — this catches the fatal flaws before they become public.
Set a comment deadline
Send the RFC with a specific deadline — usually five to seven business days. "Comments by Thursday EOD." Without a deadline, reviews sit unread indefinitely. With one, people actually read it.
Respond to every substantive comment
Even if you disagree. Especially if you disagree. Document your response in the doc itself, not just in Slack. Silence on a comment means the concern is still open.
Hold a 30-minute sync for unresolved items only
Don't re-present the whole RFC. Only discuss the comments that haven't been resolved in writing. This keeps the meeting short and the agenda clear.
Record the final decision explicitly
Add a "Decision" section at the top of the RFC: Approved / Approved with modifications / Rejected. Date it. Name the approvers. This closes the loop.

Real example — what happens without an RFC

A team spent eight weeks building a new authentication service. On the day they planned to deploy it, the security team said "we don't support that token format." It was a two-line comment that would have taken thirty seconds to raise on a document. Instead it caused a two-week slip, a tense cross-team conversation, and a full redesign of the token layer.

The security team didn't know about the project until the deploy was scheduled. Nobody had shared an RFC with them.

Prototype vs. Design — Knowing Which Tool to Pick

When you're facing an uncertain technical decision, you have two tools: you can think through it carefully and write a design, or you can build a quick prototype and let reality tell you the answer. Both are legitimate. Using the wrong one is expensive.

When to design (think first)

Design is the right tool when the question is "should we?" not "can we?". Specifically, design-first works when:

→ The decision involves multiple teams or has wide blast radius
→ You need alignment before you start, not after
→ The question is about trade-offs between known options, not about whether something is technically possible
→ The cost of being wrong is measured in months, not hours

When to prototype (build first)

Prototyping is the right tool when the answer requires empirical evidence. Specifically, when:

→ The question is "can we hit this latency target?" — you need a benchmark, not a doc
→ You're evaluating a third-party system or API and need to test its actual behavior
→ The design has a load-bearing assumption that nobody has actually verified
→ You need to show stakeholders something concrete to get buy-in

The key distinction

A prototype answers a factual question. A design answers a trade-off question. "Can our storage layer handle 50k writes per second?" is factual — prototype it. "Should we use a relational database or a document store for this use case?" is a trade-off — design it.

The cost-of-information framework

Here is a simple way to decide how much to invest before making a decision:

Ask two questions. First: what is the cost of making the wrong decision? (measured in weeks of rework, money, or team credibility). Second: what is the cost of gathering better information? (measured in days to run a spike, or hours to write a design doc).

If the cost of gathering information is less than 10% of the cost of being wrong, gather the information. If it's more than 50%, just decide and accept the risk.

Applying the framework

You're choosing between two database technologies for a write-heavy pipeline. Getting the decision wrong would cost about four weeks of migration work (high cost of being wrong). Running a two-day benchmark on both options costs two days (low cost of information). Two days is way less than 10% of four weeks. Run the benchmark.

Alternatively: you're choosing between two open-source charting libraries for an admin dashboard. Getting it wrong costs half a day to swap. Research costs two days. Just pick one. The research cost exceeds the cost of being wrong.

Resolving Technical Disagreements

Technical disagreements are inevitable on any interesting project. Two smart engineers can look at the same problem and reach completely different conclusions. This is not a sign something is wrong. It's a sign the problem is hard and both people are engaged.

The failure mode is not disagreement. The failure mode is disagreement that doesn't get resolved — it just goes underground. The engineers nominally agree but privately pursue different approaches. Or one engineer capitulates without actually changing their mind and does a half-hearted implementation of a decision they think is wrong.

Step one — make the disagreement concrete

Most technical disagreements, when you push on them, are actually disagreements about assumptions rather than about the decision itself. "We should use Kafka" vs. "we should use direct RPCs" often comes down to different beliefs about the expected message volume, the importance of durability, or how often the consumer will be unavailable. These are factual questions — they have answers.

The fastest way to make this visible: ask both engineers to independently write down their recommendation and the three assumptions behind it. Then compare the assumptions. You'll often find the disagreement was never about Kafka vs. RPCs — it was about whether the consumer needs to handle backlog during outages.

Insight

When you have a persistent technical disagreement, ask: "What would have to be true for the other side to be right?" If you can genuinely articulate that, you've found the crux of the disagreement. Write it down. That is what needs to be decided — not the surface-level argument.

Step two — define what "winning" looks like for each option

Before arguing, agree on evaluation criteria. What would make one approach clearly better than the other? Latency under some threshold? Fewer than N lines of operational code? Works without a dedicated operations engineer on-call?

Write the criteria down before the debate starts. This prevents the goalposts from moving during the discussion. If someone introduces a new criterion after the leading option has been identified, you're allowed to ask: "Is this a criterion we all agreed on upfront, or is it a post-hoc objection?"

Step three — "disagree and commit," but do it right

Sometimes you've gone through all the steps and there is still a genuine difference of opinion. Neither side is provably wrong. The assumptions are stated, the criteria are written, and it still comes down to judgment.

In this case, someone with the authority to decide makes the call. And the losing side commits to the decision — not performatively, but genuinely. This is "disagree and commit."

But here is the part most people skip: the disagreement must be stated out loud, not just swallowed. The engineer who lost the argument should be explicitly invited to say "I disagree with this decision because X, and if we proceed this way I think we'll see Y." That gets written in the decision log. Then they commit to executing the decision fully.

When "disagree and commit" breaks down

If the engineer who disagrees was never given a real chance to state their case — if the decision was made before the discussion started — then "disagree and commit" is not a principle, it's a management tool for suppressing dissent. Engineers notice the difference immediately. The result is not commitment. It's compliance, which looks the same from the outside and produces dramatically worse outcomes.

The Escalation Ladder

Escalation is a legitimate and necessary tool. Some decisions genuinely require more authority than the people in the room have. Some disputes genuinely cannot be resolved at the working level. The key is knowing when to escalate and how to do it without destroying relationships.

When escalation is appropriate

✓ The decision has cross-team impact and the teams cannot agree after a good-faith attempt
✓ A blocker has been unresolved for more than two weeks and has a concrete timeline impact
✓ The decision involves resource allocation (headcount, budget) that the working team doesn't control
✓ There is a genuine conflict of organizational priorities — not a technical disagreement

When escalation is the wrong tool

✗ You haven't actually tried to resolve it at the working level first
✗ You want someone else to take ownership of the risk or the blame
✗ You're escalating as a pressure tactic to force the other side to cave
✗ The disagreement is a genuine technical trade-off that the working team is qualified to resolve

How to escalate well

When you do escalate, the format matters. You are not asking your manager or their manager to relitigate the entire discussion. You are asking them to make one specific decision that you and the other party cannot make together.

Escalation format (what to write or say)

Context:   We are trying to decide X for the Y project.

Options:   Team A recommends Option 1 because [reason].
            Team B recommends Option 2 because [reason].

Blocker:   We have discussed this for Z weeks and cannot reach agreement.
            The specific disagreement is [one sentence — the crux].

Request:   We need a decision by [date] because [timeline impact].
            We are asking you to decide, or to facilitate a decision.

What we're NOT asking for:
            A full technical review. We just need someone to break the tie.

If you can't fill in that template, you're not ready to escalate. The act of filling it in often resolves the disagreement — you realize it was about different assumptions, not an actual impasse.

The Decision Log

A decision log is a living document — usually a shared page or a folder of small files — where every significant architectural decision is recorded with its rationale. Not the implementation details. Just: what did we decide, why, what alternatives did we reject, and who was involved.

The decision log is not for the current team. The current team was in the room. The decision log is for the engineer who joins eighteen months from now, looks at a strange piece of the architecture, and types "why" into the search bar.

What goes in a decision log entry

Decision Log Entry — Template

Title:      Use Postgres for the primary user store instead of DynamoDB
Date:       2024-03-15
Status:     Accepted
Deciders:   Ana (staff eng), Marcus (tech lead), Priya (eng manager)

Context:
  We are building a new user profile service. The data is relational,
  with complex join patterns across users, organizations, and roles.
  We need strong consistency on writes.

Decision:
  Use Postgres as the primary store.

Reason:
  Query patterns are relational. Consistency requirements are strong.
  The team has deep Postgres operational experience. Scaling needs
  are well within what a single Postgres primary can handle for 2+ years.

Alternatives rejected:
  DynamoDB — insufficient query flexibility, harder to do ad-hoc joins
              for analytics, team has no operational expertise.
  MongoDB   — no strong consistency guarantees without extra config,
              doesn't match the relational data model.

Consequences:
  We will need to plan for read replicas at >10k QPS. We accept
  that horizontal write scaling will require sharding if we grow
  significantly beyond current projections.

One entry like this takes twenty minutes to write. It has saved teams weeks of confusion, months of misguided refactoring, and countless "why did they do it this way?" Slack threads.

Architecture Decision Records (ADRs)

A formalization of the decision log is the Architecture Decision Record (ADR), popularized by Michael Nygard. ADRs are small files — typically stored in the repository itself under a docs/decisions/ folder — that follow a consistent format and live alongside the code they describe.

The key property of ADRs is that they are immutable once accepted. If you change the decision, you don't edit the old ADR — you write a new one that supersedes it. This means you can trace the full evolution of a decision over time. You can see that you chose Postgres, then moved to Vitess for horizontal scaling, and understand why each transition happened.

A small practice with large compounding returns

Teams that maintain good ADRs report that new engineers reach full productivity significantly faster. They spend less time reverse-engineering intent from code and more time contributing. The twenty minutes per decision compounds into weeks of onboarding time saved per new hire.

Decision-Making at Different Scales

The right process depends on the scope of the decision. Here is a rough guide to what level of process is appropriate at each scale:

Scope	Example	Process	Artifact
Within one function or class	Naming, data structure for a local cache	None — just decide	Inline comment if non-obvious
Within one service	New internal API shape, query optimization	Discuss with immediate team	Brief ADR or PR description
Across two or three services	New event format, shared library, data contract	RFC, 3-5 day review window	RFC document + ADR
Across a domain or platform	New data storage technology, auth model	Full design doc + RFC + senior review	Design doc + ADR
Org-wide / strategic	Cloud provider choice, new service mesh, monolith split	RFC + executive review + dedicated decision meeting	Design doc + ADR + exec sign-off

How Good Decision-Making Compounds

There is a second-order effect to good decision-making that takes a few years to become visible. Teams that make decisions well — fast on reversible, slow on irreversible, always documented — accumulate a structural advantage over teams that don't.

First, they accumulate context. Every recorded decision is a building block for the next one. New engineers arrive with a written history. Senior engineers don't have to carry everything in their heads.

Second, they accumulate trust. Teams that can point to a clear decision trail — "here is why we made this call, here is who was involved, here is what we were trading off" — earn credibility with stakeholders and leadership. When they say "we're confident in this architecture," people believe them.

Third, they accumulate speed. Counterintuitively, teams with more process around decisions move faster, not slower. Because they don't re-litigate the same decisions. Because they don't spend months undoing architecture mistakes that a one-day RFC would have caught. Because new engineers contribute meaningfully in weeks, not months.

The compounding principle

A team that makes fifty decisions per quarter, documents each one, and builds on them is not just fifty decisions ahead of a team that doesn't. It is compounding those decisions into a system that is faster to change, easier to understand, and harder to accidentally break.

Chapter Summary

"The two most important properties of a design decision are how reversible it is and whether the reasoning behind it was written down. Everything else is secondary."

Most Common Mistakes

Over-deliberating on two-way-door decisions and under-deliberating on one-way-door decisions
Running RFCs that nobody reads because the review window is too short or too long
Prototyping when you need a design doc, or writing a design doc when you need a benchmark
Letting "disagree and commit" become "suppress dissent and pretend it's commitment"
Escalating as a way to avoid owning a decision, rather than to unblock a genuine impasse
Making decisions in meetings and leaving no written artifact behind

Three Questions for Your Next Design Review

For each major architectural choice: is this a one-way door or a two-way door? Am I spending the proportional amount of time on it?
When was the last time this team made a decision that it later reversed without significant pain? What would have caught it earlier?
If the three most senior engineers on this team left tomorrow, where would the next engineer go to understand why the system is built the way it is?