Staff Engineer's Field Guide

Principles of Distributed Systems Design

A practitioner's guide to building scalable, fault-tolerant, reliable, maintainable, and operationally efficient systems — focused on judgment, not just mechanisms.

10 Parts

38 Chapters

4 Appendices

~600 Pages

Explore the contents

Preface: Why Another Book

Most distributed systems books teach you mechanisms — how Raft works, what consistent hashing is. This book teaches you judgment — when to use which mechanism, what you're trading away, and how to reason about systems you've never seen before. The goal is to make you dangerous with a whiteboard, not just with a textbook.

01 — CORE TENSION

No Globally Optimal Design

Every distributed systems decision is a negotiation between consistency, availability, latency, throughput, and simplicity. There is only the right design for your constraints.

02 — PRIMARY SKILL

Reduce Uncertainty Over Time

Every week, the project should be less ambiguous than the week before. If the list of unknowns keeps growing, something is structurally wrong.

03 — FORGOTTEN AXIOM

The System Includes Humans

Operational complexity, on-call burden, knowledge transfer cost — these are system properties, not afterthoughts. A system that only works when its creator is present is badly designed.

Table of Contents

Each chapter ends with: the key principle in one sentence, the most common mistake, and three questions for your next design review.

Part I Foundations — How to Think, Not What to Think 4 chapters

The Eight Fallacies Are Still Killing Projects

The network is unreliable, latency is not zero, bandwidth is finite. Most engineers know these. Almost none design like they believe them.

The Three Fundamental Tensions

Consistency vs. Availability. Latency vs. Throughput. Simplicity vs. Capability. How to surface your actual constraints before you design.

The Failure Taxonomy

Crash, omission, timing, byzantine, and human failures. Not all failures are equal — the system you need depends entirely on which you're designing against.

Reasoning Under Uncertainty — The Mental Models

Two Generals Problem, Happens-Before relation, Linearizability vs. Serializability. A precise vocabulary for what "correct" means.

Part II Scalability — Handling More 5 chapters

What "Scale" Actually Means

Data volume, read throughput, write throughput, request complexity, geographic reach. Each requires a different lever. Pulling the wrong one wastes engineering years.

Partitioning — Divide and Conquer, Carefully

Range vs. hash partitioning, consistent hashing, the partition key decision, hot partitions, and the cross-partition query problem.

Replication — The Source of Most Complexity

Single-leader, multi-leader, leaderless. Replication lag. Read-your-writes. Conflict resolution and why last-write-wins is almost always wrong.

Caching — The Fastest Code Is Code That Doesn't Run

Cache invalidation, write strategies, thundering herd, stampedes, hot keys. And what caches hide until they're cold.

Load Distribution

Balancing algorithms, request hedging, backpressure — the right response to overload that most systems don't implement.

Part III Fault Tolerance — Surviving Failure 5 chapters

The Spectrum of Availability

What 99.9% vs 99.999% uptime actually means. SLA vs. SLO vs. SLI. Error budgets. Availability as a property of a call chain, not a single service.

Designing for Partial Failure

Bulkheads, timeouts (the most common mistake), circuit breakers as state machines, retry strategies that don't make overload worse.

Consensus — The Hard Problem

FLP impossibility, Paxos, Raft — leader election, log replication, membership changes. When you actually need consensus vs. when it's a crutch.

Transactions Across Services

ACID limits, distributed transactions, Sagas, idempotency, and the dual-write problem you can't atomically solve — but can design around.

The Art of Graceful Degradation

Feature tiers, load shedding, fallback strategies, and chaos engineering — breaking things before they break themselves.

Part IV Consistency and Correctness — Getting the Right Answer 4 chapters

Time Is a Lie in Distributed Systems

Physical clocks, logical clocks, vector clocks, Hybrid Logical Clocks, and Google Spanner's TrueTime — what it takes to make time useful.

The Consistency Model Landscape

Linearizability, sequential, causal, eventual consistency — what each actually guarantees, what it costs, and who really needs it.

Event Sourcing and the Immutable Log

The append-only log, event sourcing, CQRS, the projections problem — and when event sourcing makes simple things hard.

Idempotency — The Superpower

Idempotency keys, at-most-once vs. at-least-once vs. exactly-once (exactly-once is not what you think), the deduplication window.

Part V Maintainability — Systems That Don't Become Traps 4 chapters

Evolvability — Designing for Change You Can't Predict

Schema evolution, forward/backward compatibility, schema registries, API versioning costs, the strangler fig pattern.

Service Boundaries — The Decision That's Hardest to Undo

Conway's Law, DDD for finding boundaries, the distributed monolith, when microservices are actually a mistake.

Documentation as a System Property

ADRs, runbooks vs. playbooks vs. post-mortems, self-documenting systems through observability, and the decay problem.

Testing Distributed Systems

Property-based testing, contract tests, deterministic simulation testing (the FoundationDB approach), chaos testing.

Part VI Extensibility — Systems That Grow With You 3 chapters

API Design as a Distributed Systems Problem

REST vs. gRPC vs. GraphQL as a trade-off, not a religion. API contracts, versioning traps, pagination, filtering.

The Plugin Architecture and Extension Points

Designing for extensibility without infinite complexity. Webhooks, event buses, pipeline patterns, and where extension points become attack surfaces.

Platform Thinking — Building Systems Other Systems Build On

Service vs. platform, primitives vs. solutions, developer experience as a first-class concern, and the tax of platform ownership.

Part VII Operational Efficiency — Running at Scale Without Burning Out 5 chapters

Observability — You Can't Fix What You Can't See

Metrics (RED method), structured logs, distributed traces, continuous profiling. Observability-driven development from day one.

The On-Call Experience as a Design Constraint

Toil, alert fatigue as a design failure, runbooks that actually work, blameless post-mortems, the five-year test.

Capacity Planning — Thinking in Orders of Magnitude

Back-of-envelope estimation, the numbers every engineer should know, headroom, traffic patterns, cost as a non-independent target.

Deployment and Release Engineering

Feature flags, canary deployments, blue-green deployments, database migrations in live systems — the expand-contract pattern.

Cost as a System Property

Cloud costs are an architecture problem. Data transfer tax, storage tiering, build vs. buy through a cost lens, FinOps.

Part VIII Data Systems — Where Everything Gets Hard 3 chapters

Storage Engine Internals (What You Need to Know)

B-trees vs. LSM trees, the storage trilemma, compaction, column-oriented storage. The most load-bearing architectural decision for data-heavy systems.

Stream Processing vs. Batch Processing

Lambda vs. Kappa architectures, event time vs. processing time, watermarks, stateful stream processing, when batch is still right.

The Coordination Problem

Distributed locks, leader election, distributed rate limiting, distributed cron — harder than it looks, and often avoidable.

Part IX Security as a System Property 2 chapters

Security Is Not a Feature, It's a Constraint

Threat modeling, defense in depth, least privilege in service-to-service comms, secrets management, blast radius reasoning.

Trust in Distributed Systems

Zero trust networking, mTLS, certificate management at scale, SPIFFE/SPIRE, service identity, authentication vs. authorization at the system level.

Part X The Human System 3 chapters

Conway's Law and Organizational Design

You cannot design a system better than your team's communication structure. Inverse Conway Maneuver, Team Topologies, ownership gaps.

The Decision-Making Framework for System Design

Irreversible vs. reversible decisions, the RFC process, prototyping vs. designing, escalation, resolving technical disagreements.

Building a Culture of Reliability

Reliability through process, not heroism. Production readiness reviews, SRE model, blameless culture, game days.

Epilogue: The Principles Behind the Principles

01

Make the Implicit Explicit

Undocumented assumptions are liabilities. Make every constraint, every trade-off, every failure mode visible — in design docs, in code, in runbooks, in conversations.

02

Design for the Failure You Haven't Imagined Yet

You will not enumerate all failure modes. Design systems that degrade gracefully under unknown failures, not just the ones you planned for.

03

The System Includes the Humans

Operational complexity, on-call burden, knowledge transfer cost — these are system properties, not afterthoughts. A system that works perfectly when its creator is available and fails mysteriously when they're not is a badly designed system.

Read the full Epilogue →

Appendices

The Numbers Every Engineer Should Know — latency, throughput, availability, and cost reference card

Back-of-Envelope Estimation — 5 worked examples: URL shortener, social feed, video platform, location service, rate limiter

Recommended Reading — books, foundational papers, and engineering essays that changed how the field thinks

Glossary of Precise Terms — because "consistency" means four different things depending on who's talking