The Decision-Making Framework for System Design
Every distributed system is a long chain of decisions. Some were made in a meeting room. Some were made in a Slack message at 11pm. Some were never made at all — they just happened. This chapter is about making those decisions deliberately, at the right speed, with the right people, and in a way you can defend six months later.
What's in this chapter
- Why reversibility is the most important property of a decision
- How to run an RFC process that produces real alignment
- When to prototype instead of design — and when not to
- The cost-of-information framework for choosing how much to invest upfront
- How to resolve technical disagreements without damaging the team
- What escalation is for, and when it becomes a crutch
- The decision log — the artifact that saves projects
- How good decision-making compounds over time
Key Learnings — Read This First
The Problem With How We Make Decisions
Imagine you're designing a new data pipeline. Three engineers have three different opinions on whether to use a message queue or a direct RPC call between services. You talk about it in a meeting for forty minutes. Someone says "let's just go with Kafka, we know it." Everyone nods. The meeting ends.
Six months later, the pipeline is in production. A new engineer asks why you're using Kafka for a use case where direct calls would have been simpler and cheaper. Nobody in the room remembers the reasoning. The original engineer who suggested Kafka has moved teams. The decision lives only in someone's fading memory of a meeting.
This is how most architectural decisions get made. Not badly, exactly — but invisibly. And invisible decisions accumulate into systems that seem arbitrary to anyone who joins later.
Good decision-making in distributed systems is not about always making the right call. You won't. Systems are too complex, and you never have complete information. Good decision-making is about three things: making the right decision at the right speed, surfacing disagreements before they become production bugs, and leaving behind enough context that future engineers can understand and change what you built.
A decision made fast with a clear written rationale is almost always better than a decision made slowly with no written rationale. Speed matters. Documentation matters more.
The Reversibility Axis
The single most useful dimension to assess in any decision is: how hard is it to reverse?
Jeff Bezos called these "one-way door" and "two-way door" decisions. The framing is useful because it's concrete. You can walk back through a two-way door. You cannot walk back through a one-way door.
Here is the principle: the less reversible a decision, the more time you should spend making it. The more reversible, the faster you should decide and move on. Most engineers apply the exact opposite rule — they over-deliberate on trivial decisions and under-deliberate on the ones that matter.
Classifying decisions by reversibility
| Decision | Type | Cost to Reverse | How to Treat It |
|---|---|---|---|
| Variable naming, code structure | Two-way | Minutes | Decide in seconds. Don't discuss. |
| Which logging library to use | Two-way | Hours | Decide in 30 minutes. One person. |
| REST vs. gRPC for a new internal API | Moderate | Days–weeks | Small discussion, write down the reason. |
| Synchronous vs. async communication between services | Moderate | Weeks | RFC or design doc section, explicit sign-off. |
| Primary data model for a core entity | One-way | Months + migration | Full design doc, alternatives considered, senior review. |
| Build vs. buy for core infrastructure | One-way | Quarters + org change | Spike, RFC, exec alignment, written decision. |
| Service boundary / ownership split | One-way | Quarters + politics | Treat like an organizational decision, not a technical one. |
Teams spend three hours debating which logging library to use and thirty minutes deciding their data model. The energy is exactly backwards. The data model will still be with you in five years. The logging library can be swapped on a Saturday afternoon.
A practical way to apply this: before any discussion, someone should say "is this a one-way door or a two-way door?" If it's two-way, time-box the discussion to fifteen minutes and make a call. If it's one-way, block time, write something down, and get the right people in the room.
The RFC Process — Alignment Before Code
RFC stands for "Request for Comments." The term comes from the internet standards world, where every major protocol was designed through a written proposal that anyone could comment on. In a software team, the RFC process is simpler: before building something significant, write down what you plan to build and why, share it with the people it affects, and collect their objections before a line of code is written.
This is not bureaucracy. This is the cheapest possible way to surface problems. A comment on a document takes five minutes. Refactoring a deployed service because you didn't consult the right person takes five weeks.
What belongs in an RFC
An RFC is not a full design doc. It's lighter — typically two to five pages. Its job is to answer one question: given what I'm planning to do, is there anything obvious I'm missing or anything that would block another team?
How to run the review
The review process matters as much as the document itself. Here is the sequence that works:
-
Share informally first
Before the official review, send the draft to two or three people whose opinion you trust and who might have the sharpest objections. "I'd love early feedback before I share more broadly" — this catches the fatal flaws before they become public.
-
Set a comment deadline
Send the RFC with a specific deadline — usually five to seven business days. "Comments by Thursday EOD." Without a deadline, reviews sit unread indefinitely. With one, people actually read it.
-
Respond to every substantive comment
Even if you disagree. Especially if you disagree. Document your response in the doc itself, not just in Slack. Silence on a comment means the concern is still open.
-
Hold a 30-minute sync for unresolved items only
Don't re-present the whole RFC. Only discuss the comments that haven't been resolved in writing. This keeps the meeting short and the agenda clear.
-
Record the final decision explicitly
Add a "Decision" section at the top of the RFC: Approved / Approved with modifications / Rejected. Date it. Name the approvers. This closes the loop.
A team spent eight weeks building a new authentication service. On the day they planned to deploy it, the security team said "we don't support that token format." It was a two-line comment that would have taken thirty seconds to raise on a document. Instead it caused a two-week slip, a tense cross-team conversation, and a full redesign of the token layer.
The security team didn't know about the project until the deploy was scheduled. Nobody had shared an RFC with them.
Prototype vs. Design — Knowing Which Tool to Pick
When you're facing an uncertain technical decision, you have two tools: you can think through it carefully and write a design, or you can build a quick prototype and let reality tell you the answer. Both are legitimate. Using the wrong one is expensive.
When to design (think first)
Design is the right tool when the question is "should we?" not "can we?". Specifically, design-first works when:
- → The decision involves multiple teams or has wide blast radius
- → You need alignment before you start, not after
- → The question is about trade-offs between known options, not about whether something is technically possible
- → The cost of being wrong is measured in months, not hours
When to prototype (build first)
Prototyping is the right tool when the answer requires empirical evidence. Specifically, when:
- → The question is "can we hit this latency target?" — you need a benchmark, not a doc
- → You're evaluating a third-party system or API and need to test its actual behavior
- → The design has a load-bearing assumption that nobody has actually verified
- → You need to show stakeholders something concrete to get buy-in
A prototype answers a factual question. A design answers a trade-off question. "Can our storage layer handle 50k writes per second?" is factual — prototype it. "Should we use a relational database or a document store for this use case?" is a trade-off — design it.
The cost-of-information framework
Here is a simple way to decide how much to invest before making a decision:
Ask two questions. First: what is the cost of making the wrong decision? (measured in weeks of rework, money, or team credibility). Second: what is the cost of gathering better information? (measured in days to run a spike, or hours to write a design doc).
If the cost of gathering information is less than 10% of the cost of being wrong, gather the information. If it's more than 50%, just decide and accept the risk.
You're choosing between two database technologies for a write-heavy pipeline. Getting the decision wrong would cost about four weeks of migration work (high cost of being wrong). Running a two-day benchmark on both options costs two days (low cost of information). Two days is way less than 10% of four weeks. Run the benchmark.
Alternatively: you're choosing between two open-source charting libraries for an admin dashboard. Getting it wrong costs half a day to swap. Research costs two days. Just pick one. The research cost exceeds the cost of being wrong.
Resolving Technical Disagreements
Technical disagreements are inevitable on any interesting project. Two smart engineers can look at the same problem and reach completely different conclusions. This is not a sign something is wrong. It's a sign the problem is hard and both people are engaged.
The failure mode is not disagreement. The failure mode is disagreement that doesn't get resolved — it just goes underground. The engineers nominally agree but privately pursue different approaches. Or one engineer capitulates without actually changing their mind and does a half-hearted implementation of a decision they think is wrong.
Step one — make the disagreement concrete
Most technical disagreements, when you push on them, are actually disagreements about assumptions rather than about the decision itself. "We should use Kafka" vs. "we should use direct RPCs" often comes down to different beliefs about the expected message volume, the importance of durability, or how often the consumer will be unavailable. These are factual questions — they have answers.
The fastest way to make this visible: ask both engineers to independently write down their recommendation and the three assumptions behind it. Then compare the assumptions. You'll often find the disagreement was never about Kafka vs. RPCs — it was about whether the consumer needs to handle backlog during outages.
When you have a persistent technical disagreement, ask: "What would have to be true for the other side to be right?" If you can genuinely articulate that, you've found the crux of the disagreement. Write it down. That is what needs to be decided — not the surface-level argument.
Step two — define what "winning" looks like for each option
Before arguing, agree on evaluation criteria. What would make one approach clearly better than the other? Latency under some threshold? Fewer than N lines of operational code? Works without a dedicated operations engineer on-call?
Write the criteria down before the debate starts. This prevents the goalposts from moving during the discussion. If someone introduces a new criterion after the leading option has been identified, you're allowed to ask: "Is this a criterion we all agreed on upfront, or is it a post-hoc objection?"
Step three — "disagree and commit," but do it right
Sometimes you've gone through all the steps and there is still a genuine difference of opinion. Neither side is provably wrong. The assumptions are stated, the criteria are written, and it still comes down to judgment.
In this case, someone with the authority to decide makes the call. And the losing side commits to the decision — not performatively, but genuinely. This is "disagree and commit."
But here is the part most people skip: the disagreement must be stated out loud, not just swallowed. The engineer who lost the argument should be explicitly invited to say "I disagree with this decision because X, and if we proceed this way I think we'll see Y." That gets written in the decision log. Then they commit to executing the decision fully.
If the engineer who disagrees was never given a real chance to state their case — if the decision was made before the discussion started — then "disagree and commit" is not a principle, it's a management tool for suppressing dissent. Engineers notice the difference immediately. The result is not commitment. It's compliance, which looks the same from the outside and produces dramatically worse outcomes.
The Escalation Ladder
Escalation is a legitimate and necessary tool. Some decisions genuinely require more authority than the people in the room have. Some disputes genuinely cannot be resolved at the working level. The key is knowing when to escalate and how to do it without destroying relationships.
When escalation is appropriate
- ✓ The decision has cross-team impact and the teams cannot agree after a good-faith attempt
- ✓ A blocker has been unresolved for more than two weeks and has a concrete timeline impact
- ✓ The decision involves resource allocation (headcount, budget) that the working team doesn't control
- ✓ There is a genuine conflict of organizational priorities — not a technical disagreement
When escalation is the wrong tool
- ✗ You haven't actually tried to resolve it at the working level first
- ✗ You want someone else to take ownership of the risk or the blame
- ✗ You're escalating as a pressure tactic to force the other side to cave
- ✗ The disagreement is a genuine technical trade-off that the working team is qualified to resolve
How to escalate well
When you do escalate, the format matters. You are not asking your manager or their manager to relitigate the entire discussion. You are asking them to make one specific decision that you and the other party cannot make together.
If you can't fill in that template, you're not ready to escalate. The act of filling it in often resolves the disagreement — you realize it was about different assumptions, not an actual impasse.
The Decision Log
A decision log is a living document — usually a shared page or a folder of small files — where every significant architectural decision is recorded with its rationale. Not the implementation details. Just: what did we decide, why, what alternatives did we reject, and who was involved.
The decision log is not for the current team. The current team was in the room. The decision log is for the engineer who joins eighteen months from now, looks at a strange piece of the architecture, and types "why" into the search bar.
What goes in a decision log entry
One entry like this takes twenty minutes to write. It has saved teams weeks of confusion, months of misguided refactoring, and countless "why did they do it this way?" Slack threads.
Architecture Decision Records (ADRs)
A formalization of the decision log is the Architecture Decision Record (ADR), popularized
by Michael Nygard. ADRs are small files — typically stored in the repository itself under
a docs/decisions/ folder — that follow a consistent format and live alongside the code
they describe.
The key property of ADRs is that they are immutable once accepted. If you change the decision, you don't edit the old ADR — you write a new one that supersedes it. This means you can trace the full evolution of a decision over time. You can see that you chose Postgres, then moved to Vitess for horizontal scaling, and understand why each transition happened.
Teams that maintain good ADRs report that new engineers reach full productivity significantly faster. They spend less time reverse-engineering intent from code and more time contributing. The twenty minutes per decision compounds into weeks of onboarding time saved per new hire.
Decision-Making at Different Scales
The right process depends on the scope of the decision. Here is a rough guide to what level of process is appropriate at each scale:
| Scope | Example | Process | Artifact |
|---|---|---|---|
| Within one function or class | Naming, data structure for a local cache | None — just decide | Inline comment if non-obvious |
| Within one service | New internal API shape, query optimization | Discuss with immediate team | Brief ADR or PR description |
| Across two or three services | New event format, shared library, data contract | RFC, 3-5 day review window | RFC document + ADR |
| Across a domain or platform | New data storage technology, auth model | Full design doc + RFC + senior review | Design doc + ADR |
| Org-wide / strategic | Cloud provider choice, new service mesh, monolith split | RFC + executive review + dedicated decision meeting | Design doc + ADR + exec sign-off |
How Good Decision-Making Compounds
There is a second-order effect to good decision-making that takes a few years to become visible. Teams that make decisions well — fast on reversible, slow on irreversible, always documented — accumulate a structural advantage over teams that don't.
First, they accumulate context. Every recorded decision is a building block for the next one. New engineers arrive with a written history. Senior engineers don't have to carry everything in their heads.
Second, they accumulate trust. Teams that can point to a clear decision trail — "here is why we made this call, here is who was involved, here is what we were trading off" — earn credibility with stakeholders and leadership. When they say "we're confident in this architecture," people believe them.
Third, they accumulate speed. Counterintuitively, teams with more process around decisions move faster, not slower. Because they don't re-litigate the same decisions. Because they don't spend months undoing architecture mistakes that a one-day RFC would have caught. Because new engineers contribute meaningfully in weeks, not months.
A team that makes fifty decisions per quarter, documents each one, and builds on them is not just fifty decisions ahead of a team that doesn't. It is compounding those decisions into a system that is faster to change, easier to understand, and harder to accidentally break.
Chapter Summary
"The two most important properties of a design decision are how reversible it is and whether the reasoning behind it was written down. Everything else is secondary."
Most Common Mistakes
- Over-deliberating on two-way-door decisions and under-deliberating on one-way-door decisions
- Running RFCs that nobody reads because the review window is too short or too long
- Prototyping when you need a design doc, or writing a design doc when you need a benchmark
- Letting "disagree and commit" become "suppress dissent and pretend it's commitment"
- Escalating as a way to avoid owning a decision, rather than to unblock a genuine impasse
- Making decisions in meetings and leaving no written artifact behind
Three Questions for Your Next Design Review
- For each major architectural choice: is this a one-way door or a two-way door? Am I spending the proportional amount of time on it?
- When was the last time this team made a decision that it later reversed without significant pain? What would have caught it earlier?
- If the three most senior engineers on this team left tomorrow, where would the next engineer go to understand why the system is built the way it is?