Chapter 28: Capacity Planning — Principles of Distributed Systems Design

What's in this chapter

Why capacity planning fails — and the mindset shift that fixes it
The back-of-envelope method: a repeatable process, not a guess
The numbers every engineer should have memorized (with intuition, not just values)
Three worked examples: a read-heavy API, a write-heavy event pipeline, a storage system
How to size headroom — how much buffer is too little, how much is wasteful
Traffic patterns and seasonality: why "average load" will mislead you
Cost and performance as linked targets, not separate ones
How to turn an estimate into a plan your team can actually execute

Key Learnings — Read This First

Orders of magnitude matter more than precision. Is the answer closer to 1,000 or 1,000,000? Knowing the right digit is what lets you make an architectural decision. Being off by 2× is fine; being off by 1,000× is catastrophic.
Memorize ten numbers and derive everything else. A small reference table of latency and throughput values lets you sanity-check any system design in minutes.
Design for peak, not average. Systems fail at peak load. Average load is nearly irrelevant for capacity decisions. Always find your peak multiplier.
Headroom is not waste — it is insurance with a known premium. Running at 50% capacity in steady state costs money but buys you time to react. Running at 95% means one traffic spike wipes you out.
Traffic patterns are spiky, not flat. Daily cycles, weekly cycles, product launches, seasonal events — each creates a multiplier on top of average. Ignore these and you will be paged.
Cost and performance are the same knob. You cannot optimize one without a plan for the other. Treat cost as a constraint from day one, not a post-launch cleanup task.
An estimate without assumptions written down is not an estimate. The assumptions are the most valuable part. When reality diverges from your plan, the assumption list tells you why.

Why Capacity Planning Goes Wrong

Most capacity planning efforts fail in one of two ways. The first is not doing it at all — teams launch a system, wait until it falls over under load, then scramble to add machines. The second is getting paralysed by the attempt to be precise. Engineers spend weeks building spreadsheets with dozens of variables, trying to predict the exact number of servers needed in eighteen months. Neither approach works.

The problem with the first approach is obvious: you are reacting to failure instead of preventing it. The problem with the second is subtler. You are building a false sense of certainty. The inputs to a precise model — future traffic growth, query patterns, data distribution — are all guesses. Multiplying guesses together doesn't produce accuracy, it produces confident nonsense.

The right approach sits between these extremes. You want an estimate that is good enough to make a decision, produced quickly enough to be useful, with the uncertainty made explicit so you know what to watch.

Core Insight

Capacity planning is not forecasting. It is answering the question: "Given what we know, what is the cheapest way to ensure this system does not fall over?" Precision is not the goal. Adequate headroom is.

The order-of-magnitude mindset

When you're estimating, you don't need to know if the answer is 4,200 or 4,800. You need to know if it's closer to 1,000 or 10,000. The difference between those two is ten times, which changes your architecture. The difference between 4,200 and 4,800 doesn't change anything.

This mental model — thinking in powers of ten — is what separates engineers who are comfortable with estimates from engineers who are paralysed by them. Once you accept that "roughly ten thousand" is a useful answer, estimation becomes fast and approachable.

The question is never "what is the exact number?" It is always one of these:

Is this problem in the hundreds, thousands, or millions?
Is this thing fast enough, or is it off by a factor of ten?
Do we need one machine, ten machines, or a hundred?

The Numbers Every Engineer Should Know

Knowing a small set of hardware and network numbers by heart — not the exact values, but the rough magnitude — lets you sanity-check any design without looking anything up. Here is the reference table. Don't memorise the precise figures. Memorise the category each operation falls into.

~0.5 ns

L1 cache reference

Fastest thing a CPU does. Sub-nanosecond.

~7 ns

L2 cache reference

14× slower than L1. Still faster than almost everything else.

~100 ns

Main memory (RAM) reference

~200× slower than L1. Your baseline for "fast".

~10 μs

SSD random read

100× slower than RAM. But still fast enough for most use cases.

~1–10 ms

Same-datacenter network round trip

100–1000× slower than SSD. This is your distributed systems baseline.

~50–150 ms

Cross-region network round trip

Speed of light across continents. You cannot optimize below this.

~1 GB/s

SSD sequential read throughput

Modern NVMe SSDs can hit 3–7 GB/s. Use 1 GB/s as a safe lower bound.

~10 GB/s

Memory bandwidth

Reading a 1 GB dataset from RAM takes roughly 100ms at this rate.

~1–10 Gbit/s

Intra-datacenter network bandwidth

Between hosts in the same rack. More hops = less bandwidth.

~50 ms

HDD seek time

Why random reads on spinning disks are catastrophic for latency-sensitive work.

~1 μs

Lock/unlock a mutex

Adds up fast if your hot path acquires locks frequently.

~30 ms

TCP handshake (same DC)

Why connection pooling matters for anything called frequently.

The key mental ratios

Rather than memorizing individual values, internalize these ratios. They hold roughly across hardware generations:

Operation A	Operation B	Ratio	Implication
RAM access	SSD random read	~100×	In-memory vs. on-disk is not a small difference
SSD read	Same-DC network call	~100×	A remote call costs 100 SSD reads
Same-DC call	Cross-region call	~100×	Geographic distribution is expensive in latency
L1 cache	RAM	~200×	Cache misses in tight loops destroy performance

Rule of Thumb

Each major tier in the memory hierarchy is roughly 100× slower than the one above it. If your design requires an operation to drop down a tier unexpectedly, expect a 100× latency hit. That turns a 1ms response into a 100ms response.

The Back-of-Envelope Method

Back-of-envelope estimation is a skill, not a talent. It follows a repeatable process. The reason most engineers struggle with it is they try to estimate everything at once. The trick is to break the problem into small, independent pieces, estimate each piece, and multiply them together.

Here is the process step by step:

Step 1: Identify what you are estimating

Be specific. "How much capacity do we need?" is not estimable. "How many application server instances do we need to handle 50,000 requests per second at p99 < 100ms?" is estimable. Before you do any math, write the question down exactly.

Step 2: Write down your assumptions first

Before calculating, list everything you're assuming. Traffic growth rate, average request size, cache hit ratio, read/write ratio, average fanout. Write numbers next to each one. These assumptions become the most valuable part of the estimate — when your system behaves differently than planned, you return to this list to find which assumption was wrong.

Step 3: Break into independent sub-problems

A request to your system touches compute, network, storage, and memory. Estimate each separately. What is the CPU cost per request? What is the memory per open connection? What is the storage growth per day? Each of these can be estimated with simple multiplication.

Step 4: Round aggressively, then sense-check

Use round numbers throughout. 86,400 seconds in a day → call it 100,000. A 1% cache miss rate at 10,000 QPS → 100 misses per second. Round to the nearest power of ten where you can. At the end, sanity-check the result against something you already know. Does this number feel right? Is it in the right ballpark compared to similar systems?

Step 5: Find your load multiple

Your estimate at average load is not the one you design for. Find the ratio of your peak load to your average load. Design for peak. This single step is the one teams most often skip, and it is why systems fall over during product launches and traffic spikes.

Common Mistake

Estimating capacity at average load and then adding "a bit of buffer." This almost always produces a system that is fine on a Tuesday afternoon and falls over on a Friday evening, during a launch, or at the end of a billing cycle. You must find your actual peak multiplier.

Three Worked Examples

Example 1: A read-heavy API

You are building a product recommendation API. Users query it every time they visit the homepage. You need to figure out how many servers to run.

Worked Example — Recommendation API

1 Traffic: 10 million daily active users. Each user visits the homepage twice a day on average. That's 20 million API calls per day.

2 QPS at average: 20M / 86,400 ≈ 230 requests/second. Round to 250 RPS.

3 Peak multiplier: Traffic is not flat. Evenings see 3× average load. So peak ≈ 750 RPS. Use 1,000 RPS to be safe.

4 Per-request cost: Each request does a cache lookup (hit 80% of time) and a model inference call (on miss). Cache hit: 2ms. Cache miss: 50ms inference + 5ms DB read = 55ms. Weighted average: 0.8 × 2 + 0.2 × 55 = 12.6ms per request.

5 Server capacity: Each server handles ~500 concurrent requests at p95 latency acceptable. At 12.6ms per request, one server handles 500 / 0.0126 ≈ 40,000 RPS in theory. But real-world overhead (GC, connection handling, logging) reduces this to ~20,000 RPS per server. At 1,000 RPS peak, you need ~1 server. Add 2× headroom → 2–3 servers to start.

6 Sanity check: At 1,000 RPS with 3 servers, each handles ~330 RPS — far below the 20,000 RPS theoretical max. The slack comes from bursty traffic, deployment headroom, and GC pauses. This feels right.

Example 2: A write-heavy event pipeline

You are building an event ingestion pipeline for user analytics. Every user action in the app emits an event. You need to size the Kafka cluster and the consumer fleet.

Worked Example — Event Pipeline

1 Event volume: 10M DAU, each generating ~50 events per session, 2 sessions/day. That's 1 billion events per day.

2 Ingest rate at average: 1B / 86,400 ≈ 11,600 events/second. Round to 12,000 EPS.

3 Peak: Events cluster in evenings. Peak is 4× average → 48,000 EPS. Design for 60,000 EPS.

4 Event size: Each event is a JSON blob ~500 bytes average. At 60,000 EPS → 30 MB/s ingest throughput. With 3× replication factor in Kafka → 90 MB/s total write throughput required across the cluster.

5 Kafka brokers: A modern broker can sustain ~500 MB/s write throughput. We need 90 MB/s → 1 broker in theory. Add headroom for reads, replication traffic, compaction → 3 brokers minimum (also the minimum for a fault-tolerant cluster).

6 Storage: 30 MB/s × 86,400 s = ~2.5 TB/day. With 3× replication → 7.5 TB/day. For 7-day retention → 52 TB total. Plan for 75 TB to account for uneven distribution.

7 Consumers: Each consumer reads a partition. If processing takes 5ms per event, one consumer thread handles 200 EPS. At 60,000 EPS → 300 consumer threads. Spread across 30 machines with 10 threads each.

Example 3: A storage system

You are building a user-uploaded photo storage system. How much storage do you need, and how will you serve reads?

Worked Example — Photo Storage

1 Upload rate: 10M users. 10% upload at least one photo per week. 1M uploads/week → 143,000 uploads/day → 2 uploads/second average. Peak: 10× → 20 uploads/second.

2 Photo size: Original: 3 MB average. You transcode to 3 sizes (thumbnail 10KB, medium 300KB, full 3MB). Total per upload: ~3.3 MB stored.

3 Storage growth: 143,000 uploads/day × 3.3 MB = ~470 GB/day. In a year → ~170 TB. In 3 years → 500 TB. Plan your object store and tiering strategy around this growth curve.

4 Read traffic: Users view photos 10× more than they upload → 20 reads/second average, peak 200 reads/second. Serve via CDN — cache hit ratio for popular photos is ~95%, so origin traffic is only 10 reads/second at peak. Very manageable.

5 Metadata: Each photo has metadata (owner, timestamp, tags, dimensions) → ~1 KB. 143,000 uploads/day → 143 MB of metadata/day. After a year → 52 GB. Fits comfortably in a PostgreSQL instance.

How Much Headroom Is the Right Amount

Once you have a capacity estimate for peak load, you need to decide how much spare capacity to maintain on top of that. This is called headroom, and it's a real cost with real value. Too little headroom and a traffic spike takes you down. Too much headroom and you're burning money on idle machines.

The headroom is for three things

First: traffic spikes. Even your peak estimate is an average of peaks. Individual seconds will be spikier than your measurement window suggests. You need headroom to absorb these micro-spikes without requests failing.

Second: deployment time. When you're rolling out a new version, some of your capacity is temporarily unavailable. If you're running at 95% utilization and you take down 20% of your fleet for a deployment, you're now at 120% — and things start failing. You need headroom to deploy safely.

Third: reaction time. When you see utilization trending up — because traffic is growing or because a new feature is heavier than expected — it takes time to provision new capacity. In cloud environments this might be minutes. In on-premise environments it might be weeks. Headroom buys you that time.

The right target utilization

System Type	Target Utilization	Why
Stateless services (API servers)	50–60% at peak	Spiky traffic, easy to add capacity, deployment headroom needed
Stateful services (databases)	40–50% at peak	Much harder to scale quickly, failure is more severe, conservative target
Storage systems	60–70% full	Need time to provision more storage; some systems degrade above 80%
Message queues (Kafka)	50–60% of throughput	Must absorb producer spikes without backpressure propagating upstream
Caches (Redis)	60–70% memory	Eviction at high memory causes unexpected latency spikes in downstream systems

Key Insight

The utilization target is not based on cost preferences. It's based on how long it takes you to add capacity and how bad the failure mode is when you run out. For a system that takes 30 minutes to scale out and has no graceful degradation, 50% is not too conservative — it's about right.

The auto-scaling caveat

Many engineers assume that auto-scaling eliminates the need for capacity planning. It doesn't. Auto-scaling reduces the risk of under-provisioning, but it introduces its own problems: scale-out lag (new instances take time to warm up), cost spikes (sudden scale-out is expensive), and scaling limits (cloud providers have account-level limits on instance counts).

With auto-scaling, you still need to set minimum capacity (your base floor), maximum capacity (your cost and limit ceiling), and target utilization for scaling triggers. All of these require capacity planning. Auto-scaling does not replace the thinking — it just handles the mechanical provisioning.

Traffic Patterns: Why "Average" Will Mislead You

If there is one thing to take from this chapter, it is this: never design for average load. Systems fail at peak, not at average. The relationship between average and peak is the most underappreciated number in capacity planning.

The daily cycle

Almost every consumer-facing system has a daily traffic cycle. Traffic is lowest in the early hours of the morning (typically 2–4 AM in the primary timezone) and peaks in the evening (7–10 PM). The ratio between trough and peak can easily be 5:1 or 10:1. If you design for average, you are designing for something halfway between your trough and your peak — which means you are undersized for peak by a large factor.

Plot your traffic for a full week before making any capacity decision. Find your daily peak. That is the number you design for.

The weekly cycle

Consumer apps typically see lower traffic on weekdays and higher traffic on weekends. B2B tools show the opposite — lower on weekends, higher Monday through Thursday. If you run a mixed product, you may have a flat weekly pattern. Know which category you're in.

Event-driven spikes

Product launches, marketing campaigns, viral moments, news events — any of these can cause a spike that is 10×, 50×, or 100× your normal peak. You cannot fully capacity-plan for these. What you can do is:

Make your stateless tier auto-scale so it can grow horizontally within minutes
Design your stateful tier to degrade gracefully under load (rate limiting, queue-based back pressure) rather than fall over completely
Have a load shedding plan — a defined point at which you start declining requests in a controlled way rather than letting failures cascade

Growth trends

Your system will have a baseline traffic growth rate. Understanding this rate matters because capacity planning is not a one-time exercise. A system that handles load fine today will be in trouble in six months if you haven't accounted for growth.

Example

Your API handles 10,000 RPS today at 60% CPU utilization. Traffic grows at 15% per month. After 6 months, traffic is 10,000 × 1.15⁶ ≈ 23,000 RPS. Your current infrastructure handles 10,000 / 0.6 ≈ 16,700 RPS at 100% utilization. You will hit your ceiling at month 4. If provisioning takes 2 weeks, you need to start the process at month 3.5. This is the math your capacity plan should be tracking.

The compounding effect of percentile traffic

One subtle mistake: teams measure p50 (median) latency and p50 QPS, then size their systems to handle those numbers with headroom. But at peak load, everything gets worse simultaneously. Latency increases, error rates increase, and tail latency (p99) diverges sharply from median. If you size for median conditions plus headroom, your system will be severely underprovisioned at peak because peak latency is much higher than median latency.

The right approach: measure your p99 latency under peak load conditions, not median conditions. That is the latency number to design for.

Cost and Performance: The Same Knob

Engineers often treat performance optimization and cost optimization as separate concerns — sequential steps in a project. "First we make it fast, then we make it cheap." This framing is a mistake. Cost and performance are the same knob, just turned in different directions.

Adding machines improves throughput and reduces latency (up to a point), but it costs money. Using a faster cache reduces latency and offloads your database, but the cache itself costs money. Every architectural decision is implicitly both a performance decision and a cost decision.

The four cost levers in distributed systems

Compute: CPU and memory. The most obvious cost. Reducing per-request compute cost (through caching, more efficient algorithms, or smaller payloads) directly reduces the machine count needed at a given load.

Storage: Volume and IOPS. Raw storage is cheap. Fast storage (NVMe SSDs, provisioned IOPS) is expensive. Archiving cold data to object storage, compressing data aggressively, and tiering data by access frequency are the primary tools for controlling storage cost.

Network: Often the most surprising cost in cloud environments. Data transfer between availability zones is billed per GB. Data transfer out to the internet is billed per GB, often at higher rates. A system that moves a lot of data around — between tiers, between regions, out to clients — can have network costs that dwarf compute costs. Measure this early.

Licensing and managed service fees: Managed databases, message queues, and SaaS tools add a cost per unit of usage that scales linearly with load. At low scale these are convenient. At high scale they can be dramatically more expensive than self-hosted alternatives.

Cost as a design constraint from day one

The best time to think about cost is during architecture design, not after launch. Architectural decisions made for performance (e.g., keeping everything in memory, using synchronous cross-region replication, running a large number of small services) can have cost implications that only become visible at scale. By then, changing the architecture is expensive.

In practice: for every major architectural choice, add a cost column to your trade-off analysis. "If we go with option A, what does this cost at 10× current load? At 100×?" This doesn't require precise numbers — order-of-magnitude estimates are fine. But it should be part of the decision.

Rule of Thumb

If your cost per unit of load (cost per million API calls, cost per TB stored, cost per million events processed) is not decreasing as you scale, something in your architecture does not scale. Find it before your users find it.

The 80/20 of cost optimization

Cost is rarely evenly distributed. In most systems, a small number of operations account for most of the cost. Before optimizing anything, measure where your money is actually going. Typical findings:

One or two query patterns account for 80% of database load
One data type or path accounts for 70% of storage growth
Cross-zone or cross-region traffic in one part of the system accounts for most network spend
One expensive third-party API call is responsible for most of the external API bill

Identify your top three cost drivers and fix those first. Everything else is noise by comparison.

From Estimate to Plan

An estimate sitting in a document does not help anyone. The final step of capacity planning is turning your estimate into an actionable plan. This means:

Define your scaling triggers

Pick the metric that most directly correlates with load on each bottleneck in your system — typically CPU utilization, memory utilization, request queue depth, or disk I/O. Define the thresholds at which you act: the "yellow" threshold where you start monitoring closely, and the "red" threshold where you provision more capacity immediately. Write these down in your runbook before you need them.

Build a capacity timeline

Given your current capacity, your target utilization ceiling, and your expected traffic growth rate, calculate the date at which you will hit your ceiling. Work backward from that date by the time it takes to provision capacity, and put a calendar reminder at the right moment. This is the automation of the thinking you've already done.

Current capacity ceiling:  16,700 RPS (at 100% utilization)
Target ceiling (60% util): 10,020 RPS
Current load:              10,000 RPS   ← you are already at your target ceiling
Monthly growth rate:       15%
Time to provision:         2 weeks

→ You needed to start provisioning last week.

Review the assumptions regularly

A capacity plan is a living document. Every time you run a major product launch, observe an unexpected traffic spike, or make a significant architecture change, revisit your assumptions. The value is not in the original estimate — it's in the updated estimate that incorporates what you've learned.

Share it with stakeholders

Capacity plans are not just engineering artifacts. Product managers planning launches need to know the load those launches will generate. Finance teams need to know infrastructure cost projections. Leadership needs to know if a strategic bet requires 3× the current infrastructure. The capacity plan, presented as a simple table of expected load vs. current capacity vs. cost, is one of the most useful cross-functional documents a team can produce.

The Principle in One Sentence

Design for peak load, not average load — and maintain enough headroom that when the peak is higher than expected, you have time to respond rather than time to panic.

The Most Common Mistake

Estimating capacity at average load, treating auto-scaling as a substitute for planning, and discovering the peak-to-average multiplier only when the system falls over under it. The failure is almost never in the math — it's in not doing the math at all until it's too late.

Three Questions to Ask Before Your Next Design Review

01 What is the peak-to-average traffic ratio for this system, and is the design sized for peak — not average?
02 At what utilization percentage will this system hit its ceiling, and how long does it take to provision more capacity once that threshold is crossed?
03 What are the top two or three cost drivers at 10× current load, and does the architecture handle each of them sub-linearly or does cost grow faster than load?

← Table of Contents All Chapters Next Chapter → Ch 29: Deployment and Release Engineering