Designing Data-Intensive Applications
Book
Martin Kleppmann (2017, O'Reilly)
The single best book on this subject. Kleppmann explains replication, partitioning, transactions, consistency models, and stream processing with the rare combination of technical depth and genuine clarity. If you read only one book from this list, read this one. Chapter 9 on consistency and consensus alone is worth the price.
The Google SRE Book
Book
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy (2016, O'Reilly) — Free online
The operational side of large-scale systems, told by the people who ran them. Error budgets, toil, postmortem culture, the SLO framework — this book invented the vocabulary. It is occasionally repetitive but the core chapters on SLOs, eliminating toil, and incident management are foundational.
Database Internals
Book
Alex Petrov (2019, O'Reilly)
Goes deeper than DDIA on storage engines and consensus algorithms. If you want to understand B-trees vs. LSM trees at the implementation level, how Raft actually works in practice, and how distributed databases are built, this is the book. More technical and narrower in scope than DDIA — read it second.
Building Microservices
Book
Sam Newman (2nd ed. 2021, O'Reilly)
The most balanced treatment of service decomposition. Newman does not cheerlead for microservices — he lays out the genuine costs alongside the benefits. His chapters on service boundaries, data decomposition, and migration strategies are particularly useful for teams decomposing a monolith.
Fundamentals of Software Architecture
Book
Mark Richards & Neal Ford (2020, O'Reilly)
Practical architectural decision-making, not just patterns. The architecture characteristics (scalability, reliability, testability etc.) framework is a useful vocabulary for describing what you are optimizing for. Best read alongside DDIA — one focuses on data systems, this one on the broader architecture trade-off space.
Team Topologies
Book
Matthew Skelton & Manuel Pais (2019, IT Revolution)
The organizational design book for engineers, not managers. The four team types (stream-aligned, platform, enabling, complicated-subsystem) give you a language for the conversations that happen when Conway's Law is working against you. Short, dense, and immediately applicable.
A Philosophy of Software Design
Book
John Ousterhout (2nd ed. 2021)
The best book on managing complexity in software. Ousterhout's concept of "deep modules" — interfaces that are small but implementations that are large — is a powerful lens for distributed system design. Short (190 pages), opinionated, and usefully provocative on comments, error handling, and module depth.
These are the papers that introduced ideas now used in nearly every large distributed system. You do not need to implement them — but understanding what problem each one solved and why changes how you think about the design space.
The Google File System (GFS)
Paper
Ghemawat, Gobioff, Leung — SOSP 2003
The paper that showed you can build reliable storage from unreliable components at scale. GFS's design — large files, append-only, relaxed consistency — was shaped entirely by its workload. The paper is a masterclass in letting real requirements drive architecture rather than conventional wisdom. Read it to see how to reason from first principles.
MapReduce: Simplified Data Processing on Large Clusters
Paper
Dean & Ghemawat — OSDI 2004
The paper that made distributed batch processing accessible. MapReduce is no longer state of the art, but the underlying idea — hide fault tolerance from the programmer by automatically re-executing failed tasks — influenced every batch and stream processing system that followed. The simplicity of the programming model is the real lesson.
Dynamo: Amazon's Highly Available Key-Value Store
Paper
DeCandia et al. — SOSP 2007
The paper that brought eventual consistency into mainstream engineering. Dynamo introduced consistent hashing, vector clocks, sloppy quorums, and anti-entropy via Merkle trees — techniques now embedded in Cassandra, Riak, and DynamoDB. More importantly, it showed the real trade-offs of choosing availability over consistency in a production system run at Amazon scale.
Bigtable: A Distributed Storage System for Structured Data
Paper
Chang et al. — OSDI 2006
The paper behind HBase, Cassandra's data model, and cloud NoSQL. The SSTable/memtable pattern for LSM storage originates here. Understanding Bigtable's data model — rows, column families, timestamps as a first-class dimension — explains design decisions in every wide-column store you will encounter.
In Search of an Understandable Consensus Algorithm (Raft)
Paper
Ongaro & Ousterhout — USENIX ATC 2014
The consensus algorithm that was designed to be taught, not just implemented. If you want to understand how etcd, CockroachDB, TiKV, and dozens of other systems achieve distributed consensus, read this paper. Unlike the Paxos papers, it was explicitly written for comprehensibility — and it succeeds. After reading, you will understand leader election, log replication, and membership changes at an implementable level.
Spanner: Google's Globally-Distributed Database
Paper
Corbett et al. — OSDI 2012
The paper that demonstrated global external consistency is achievable — at a price. TrueTime (GPS + atomic clocks to bound clock uncertainty) is the mechanism. The paper is a useful corrective to the idea that you must always choose between consistency and availability — the real choice is between consistency and cost. CockroachDB and YugabyteDB are open-source implementations of the same ideas.
The Log: What every software engineer should know about real-time data's unifying abstraction
Essay
Jay Kreps (2013, LinkedIn Engineering Blog)
The essay that explains why the append-only log is the foundation of modern distributed data systems. Kreps shows how databases, replication, stream processing, and event sourcing are all instances of the same abstraction. This essay more than any other paper explains the thinking behind Kafka and why it became infrastructure. Required reading before designing any event-driven system.
Kafka: a Distributed Messaging System for Log Processing
Paper
Kreps, Narkhede, Rao — NetDB 2011
The original Kafka paper. Much simpler than modern Kafka — read it to understand the core design decisions (sequential disk I/O, consumer-managed offsets, partition model) and why each one was made. The design choices that made Kafka fast are not obvious until you see them explained in the context of the alternatives they replaced.
Time, Clocks, and the Ordering of Events in a Distributed System
Paper
Leslie Lamport — Communications of the ACM, 1978
Possibly the most important paper in distributed systems, written in 1978. Lamport introduces logical clocks and the happens-before relationship — the fundamental tool for reasoning about event ordering when you cannot trust physical clocks. Dense but short. Read it slowly. The ideas in this paper underpin vector clocks, CRDTs, and the consistency model vocabulary used throughout this book.
CAP Twelve Years Later: How the "Rules" Have Changed
Paper
Eric Brewer — IEEE Computer, 2012
Brewer's own correction to the CAP theorem he introduced. The original CAP theorem is frequently misused. Brewer's 2012 clarification explains what CAP actually says, what it does not say, and why the PACELC model (which includes latency even when there is no partition) is often more useful for real system design decisions. Read this after you think you understand CAP.
These are not academic papers — they are practitioners describing real systems with real constraints. The value is seeing how the abstract ideas above collide with production reality.
How Discord Stores Billions of Messages
Blog
Discord Engineering Blog (2017, updated 2023)
A clear, honest account of migrating from MongoDB to Cassandra to ScyllaDB as scale grew. Each migration is explained in terms of the specific failure modes encountered at the previous scale. A good illustration of how the right database choice is not universal — it depends on access patterns, scale, and operational capacity.
Lessons Learned from Running Apache Kafka at Scale
Blog
Confluent Engineering Blog (various years)
The gap between the Kafka paper and running Kafka in production is large. These posts cover consumer group rebalancing pitfalls, partition sizing, compaction issues, and cross-datacenter replication. Read when you are designing a Kafka-based system for the first time.
The Tail at Scale
Paper
Dean & Barroso — Communications of the ACM, 2013
Explains why tail latency is fundamentally different from average latency in distributed systems, and why it gets worse as you add more components. Introduces hedged requests and tied requests as techniques for attacking tail latency. Short and practical — directly applicable to any system with latency SLOs.
Zanzibar: Google's Consistent, Global Authorization System
Paper
Perm et al. — USENIX ATC 2019
The paper behind every modern authorization-as-a-service system (Ory, SpiceDB, AuthZed). Zanzibar's approach to consistency in a globally distributed auth system — using "zookies" (consistency tokens) to enforce read-your-writes without full global consistency — is a clever solution to a problem every large platform eventually faces.
Testing Distributed Systems
Essay
Andrey Satarin (comprehensive roundup, regularly updated)
The most comprehensive survey of distributed systems testing approaches. Covers TLA+ specification, deterministic simulation (FoundationDB), fault injection, Jepsen, and formal verification. If you are trying to understand how to test a system that is genuinely hard to test, start here.
FoundationDB: A Distributed Unbundled Transactional Key Value Store
Paper
Zhou et al. — SIGMOD 2021
The paper behind one of the most rigorously tested distributed systems ever built. The simulation testing approach — running the entire system in a deterministic simulation to find bugs — is genuinely different from how most systems are tested and worth understanding as a target to aspire to, even if your team cannot implement it fully.
The field moves. Papers from 2010 describe systems that are now considered foundational. Here is where to follow current thinking:
USENIX, OSDI, SOSP, VLDB Proceedings
Conferences
usenix.org · vldb.org (all proceedings are open access)
The four venues where most of the ideas in this book were first published. USENIX ATC and OSDI are the most accessible for practitioners. VLDB covers the database and data systems side more deeply. New papers from these conferences describe what large systems will look like in 3–5 years.
The Morning Paper
Blog
Adrian Colyer (hiatus since 2020, archive available)
Five years of daily summaries of important CS papers, written accessibly for practitioners. The archive is a goldmine. Search it for any topic in this book and you will find a 500-word summary that tells you whether the full paper is worth your time.
High Scalability Blog
Blog
Todd Hoff (highscalability.com)
Architecture breakdowns of real systems — Uber, Netflix, Twitter, Slack, Stripe — told from the "how does this actually work at scale" angle. Uneven in quality but consistently useful for seeing how the same problems get solved differently in different organizations.