Appendix C — Recommended Reading

How to Use This List

Start with the books — they give you the mental models. Then read the foundational papers for the systems you work with most closely. The engineering blog posts are for seeing these ideas applied at real companies with real constraints. You do not need to read everything; you need to read the right things for the problems you are currently working on.

Books

Designing Data-Intensive Applications

Book

Martin Kleppmann (2017, O'Reilly)

The single best book on this subject. Kleppmann explains replication, partitioning, transactions, consistency models, and stream processing with the rare combination of technical depth and genuine clarity. If you read only one book from this list, read this one. Chapter 9 on consistency and consensus alone is worth the price.

The Google SRE Book

Book

Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy (2016, O'Reilly) — Free online

The operational side of large-scale systems, told by the people who ran them. Error budgets, toil, postmortem culture, the SLO framework — this book invented the vocabulary. It is occasionally repetitive but the core chapters on SLOs, eliminating toil, and incident management are foundational.

Database Internals

Book

Alex Petrov (2019, O'Reilly)

Goes deeper than DDIA on storage engines and consensus algorithms. If you want to understand B-trees vs. LSM trees at the implementation level, how Raft actually works in practice, and how distributed databases are built, this is the book. More technical and narrower in scope than DDIA — read it second.

Building Microservices

Book

Sam Newman (2nd ed. 2021, O'Reilly)

The most balanced treatment of service decomposition. Newman does not cheerlead for microservices — he lays out the genuine costs alongside the benefits. His chapters on service boundaries, data decomposition, and migration strategies are particularly useful for teams decomposing a monolith.

Fundamentals of Software Architecture

Book

Mark Richards & Neal Ford (2020, O'Reilly)

Practical architectural decision-making, not just patterns. The architecture characteristics (scalability, reliability, testability etc.) framework is a useful vocabulary for describing what you are optimizing for. Best read alongside DDIA — one focuses on data systems, this one on the broader architecture trade-off space.

Team Topologies

Book

Matthew Skelton & Manuel Pais (2019, IT Revolution)

The organizational design book for engineers, not managers. The four team types (stream-aligned, platform, enabling, complicated-subsystem) give you a language for the conversations that happen when Conway's Law is working against you. Short, dense, and immediately applicable.

A Philosophy of Software Design

Book

John Ousterhout (2nd ed. 2021)

The best book on managing complexity in software. Ousterhout's concept of "deep modules" — interfaces that are small but implementations that are large — is a powerful lens for distributed system design. Short (190 pages), opinionated, and usefully provocative on comments, error handling, and module depth.

Foundational Papers

These are the papers that introduced ideas now used in nearly every large distributed system. You do not need to implement them — but understanding what problem each one solved and why changes how you think about the design space.

The Google File System (GFS)

Paper

Ghemawat, Gobioff, Leung — SOSP 2003

The paper that showed you can build reliable storage from unreliable components at scale. GFS's design — large files, append-only, relaxed consistency — was shaped entirely by its workload. The paper is a masterclass in letting real requirements drive architecture rather than conventional wisdom. Read it to see how to reason from first principles.

MapReduce: Simplified Data Processing on Large Clusters

Paper

Dean & Ghemawat — OSDI 2004

The paper that made distributed batch processing accessible. MapReduce is no longer state of the art, but the underlying idea — hide fault tolerance from the programmer by automatically re-executing failed tasks — influenced every batch and stream processing system that followed. The simplicity of the programming model is the real lesson.

Dynamo: Amazon's Highly Available Key-Value Store

Paper

DeCandia et al. — SOSP 2007

The paper that brought eventual consistency into mainstream engineering. Dynamo introduced consistent hashing, vector clocks, sloppy quorums, and anti-entropy via Merkle trees — techniques now embedded in Cassandra, Riak, and DynamoDB. More importantly, it showed the real trade-offs of choosing availability over consistency in a production system run at Amazon scale.

Bigtable: A Distributed Storage System for Structured Data

Paper

Chang et al. — OSDI 2006

The paper behind HBase, Cassandra's data model, and cloud NoSQL. The SSTable/memtable pattern for LSM storage originates here. Understanding Bigtable's data model — rows, column families, timestamps as a first-class dimension — explains design decisions in every wide-column store you will encounter.

In Search of an Understandable Consensus Algorithm (Raft)

Paper

Ongaro & Ousterhout — USENIX ATC 2014

The consensus algorithm that was designed to be taught, not just implemented. If you want to understand how etcd, CockroachDB, TiKV, and dozens of other systems achieve distributed consensus, read this paper. Unlike the Paxos papers, it was explicitly written for comprehensibility — and it succeeds. After reading, you will understand leader election, log replication, and membership changes at an implementable level.

Spanner: Google's Globally-Distributed Database

Paper

Corbett et al. — OSDI 2012

The paper that demonstrated global external consistency is achievable — at a price. TrueTime (GPS + atomic clocks to bound clock uncertainty) is the mechanism. The paper is a useful corrective to the idea that you must always choose between consistency and availability — the real choice is between consistency and cost. CockroachDB and YugabyteDB are open-source implementations of the same ideas.

The Log: What every software engineer should know about real-time data's unifying abstraction

Essay

Jay Kreps (2013, LinkedIn Engineering Blog)

The essay that explains why the append-only log is the foundation of modern distributed data systems. Kreps shows how databases, replication, stream processing, and event sourcing are all instances of the same abstraction. This essay more than any other paper explains the thinking behind Kafka and why it became infrastructure. Required reading before designing any event-driven system.

Kafka: a Distributed Messaging System for Log Processing

Paper

Kreps, Narkhede, Rao — NetDB 2011

The original Kafka paper. Much simpler than modern Kafka — read it to understand the core design decisions (sequential disk I/O, consumer-managed offsets, partition model) and why each one was made. The design choices that made Kafka fast are not obvious until you see them explained in the context of the alternatives they replaced.

Time, Clocks, and the Ordering of Events in a Distributed System

Paper

Leslie Lamport — Communications of the ACM, 1978

Possibly the most important paper in distributed systems, written in 1978. Lamport introduces logical clocks and the happens-before relationship — the fundamental tool for reasoning about event ordering when you cannot trust physical clocks. Dense but short. Read it slowly. The ideas in this paper underpin vector clocks, CRDTs, and the consistency model vocabulary used throughout this book.

CAP Twelve Years Later: How the "Rules" Have Changed

Paper

Eric Brewer — IEEE Computer, 2012

Brewer's own correction to the CAP theorem he introduced. The original CAP theorem is frequently misused. Brewer's 2012 clarification explains what CAP actually says, what it does not say, and why the PACELC model (which includes latency even when there is no partition) is often more useful for real system design decisions. Read this after you think you understand CAP.

Engineering Blog Posts & Talks Worth Your Time

These are not academic papers — they are practitioners describing real systems with real constraints. The value is seeing how the abstract ideas above collide with production reality.

How Discord Stores Billions of Messages

Blog

Discord Engineering Blog (2017, updated 2023)

A clear, honest account of migrating from MongoDB to Cassandra to ScyllaDB as scale grew. Each migration is explained in terms of the specific failure modes encountered at the previous scale. A good illustration of how the right database choice is not universal — it depends on access patterns, scale, and operational capacity.

Lessons Learned from Running Apache Kafka at Scale

Blog

Confluent Engineering Blog (various years)

The gap between the Kafka paper and running Kafka in production is large. These posts cover consumer group rebalancing pitfalls, partition sizing, compaction issues, and cross-datacenter replication. Read when you are designing a Kafka-based system for the first time.

The Tail at Scale

Paper

Dean & Barroso — Communications of the ACM, 2013

Explains why tail latency is fundamentally different from average latency in distributed systems, and why it gets worse as you add more components. Introduces hedged requests and tied requests as techniques for attacking tail latency. Short and practical — directly applicable to any system with latency SLOs.

Zanzibar: Google's Consistent, Global Authorization System

Paper

Perm et al. — USENIX ATC 2019

The paper behind every modern authorization-as-a-service system (Ory, SpiceDB, AuthZed). Zanzibar's approach to consistency in a globally distributed auth system — using "zookies" (consistency tokens) to enforce read-your-writes without full global consistency — is a clever solution to a problem every large platform eventually faces.

Testing Distributed Systems

Essay

Andrey Satarin (comprehensive roundup, regularly updated)

The most comprehensive survey of distributed systems testing approaches. Covers TLA+ specification, deterministic simulation (FoundationDB), fault injection, Jepsen, and formal verification. If you are trying to understand how to test a system that is genuinely hard to test, start here.

FoundationDB: A Distributed Unbundled Transactional Key Value Store

Paper

Zhou et al. — SIGMOD 2021

The paper behind one of the most rigorously tested distributed systems ever built. The simulation testing approach — running the entire system in a deterministic simulation to find bugs — is genuinely different from how most systems are tested and worth understanding as a target to aspire to, even if your team cannot implement it fully.

Where to Keep Learning

The field moves. Papers from 2010 describe systems that are now considered foundational. Here is where to follow current thinking:

USENIX, OSDI, SOSP, VLDB Proceedings

Conferences

usenix.org · vldb.org (all proceedings are open access)

The four venues where most of the ideas in this book were first published. USENIX ATC and OSDI are the most accessible for practitioners. VLDB covers the database and data systems side more deeply. New papers from these conferences describe what large systems will look like in 3–5 years.

The Morning Paper

Blog

Adrian Colyer (hiatus since 2020, archive available)

Five years of daily summaries of important CS papers, written accessibly for practitioners. The archive is a goldmine. Search it for any topic in this book and you will find a 500-word summary that tells you whether the full paper is worth your time.

High Scalability Blog

Blog

Todd Hoff (highscalability.com)

Architecture breakdowns of real systems — Uber, Netflix, Twitter, Slack, Stripe — told from the "how does this actually work at scale" angle. Uneven in quality but consistently useful for seeing how the same problems get solved differently in different organizations.