Staff Engineer's Field Guide

Principles of Distributed Systems Design

A practitioner's guide to building scalable, fault-tolerant, reliable, maintainable, and operationally efficient systems — focused on judgment, not just mechanisms.

10 Parts
38 Chapters
4 Appendices
~600 Pages
Explore the contents

Preface: Why Another Book

Most distributed systems books teach you mechanisms — how Raft works, what consistent hashing is. This book teaches you judgment — when to use which mechanism, what you're trading away, and how to reason about systems you've never seen before. The goal is to make you dangerous with a whiteboard, not just with a textbook.

01 — CORE TENSION

No Globally Optimal Design

Every distributed systems decision is a negotiation between consistency, availability, latency, throughput, and simplicity. There is only the right design for your constraints.

02 — PRIMARY SKILL

Reduce Uncertainty Over Time

Every week, the project should be less ambiguous than the week before. If the list of unknowns keeps growing, something is structurally wrong.

03 — FORGOTTEN AXIOM

The System Includes Humans

Operational complexity, on-call burden, knowledge transfer cost — these are system properties, not afterthoughts. A system that only works when its creator is present is badly designed.

Table of Contents

Each chapter ends with: the key principle in one sentence, the most common mistake, and three questions for your next design review.

Epilogue: The Principles Behind the Principles

01

Make the Implicit Explicit

Undocumented assumptions are liabilities. Make every constraint, every trade-off, every failure mode visible — in design docs, in code, in runbooks, in conversations.

02

Design for the Failure You Haven't Imagined Yet

You will not enumerate all failure modes. Design systems that degrade gracefully under unknown failures, not just the ones you planned for.

03

The System Includes the Humans

Operational complexity, on-call burden, knowledge transfer cost — these are system properties, not afterthoughts. A system that works perfectly when its creator is available and fails mysteriously when they're not is a badly designed system.

Read the full Epilogue →

Appendices

A

The Numbers Every Engineer Should Know — latency, throughput, availability, and cost reference card

B

Back-of-Envelope Estimation — 5 worked examples: URL shortener, social feed, video platform, location service, rate limiter

C

Recommended Reading — books, foundational papers, and engineering essays that changed how the field thinks

D

Glossary of Precise Terms — because "consistency" means four different things depending on who's talking