What's in this chapter
- Why capacity planning fails — and the mindset shift that fixes it
- The back-of-envelope method: a repeatable process, not a guess
- The numbers every engineer should have memorized (with intuition, not just values)
- Three worked examples: a read-heavy API, a write-heavy event pipeline, a storage system
- How to size headroom — how much buffer is too little, how much is wasteful
- Traffic patterns and seasonality: why "average load" will mislead you
- Cost and performance as linked targets, not separate ones
- How to turn an estimate into a plan your team can actually execute
Key Learnings — Read This First
- Orders of magnitude matter more than precision. Is the answer closer to 1,000 or 1,000,000? Knowing the right digit is what lets you make an architectural decision. Being off by 2× is fine; being off by 1,000× is catastrophic.
- Memorize ten numbers and derive everything else. A small reference table of latency and throughput values lets you sanity-check any system design in minutes.
- Design for peak, not average. Systems fail at peak load. Average load is nearly irrelevant for capacity decisions. Always find your peak multiplier.
- Headroom is not waste — it is insurance with a known premium. Running at 50% capacity in steady state costs money but buys you time to react. Running at 95% means one traffic spike wipes you out.
- Traffic patterns are spiky, not flat. Daily cycles, weekly cycles, product launches, seasonal events — each creates a multiplier on top of average. Ignore these and you will be paged.
- Cost and performance are the same knob. You cannot optimize one without a plan for the other. Treat cost as a constraint from day one, not a post-launch cleanup task.
- An estimate without assumptions written down is not an estimate. The assumptions are the most valuable part. When reality diverges from your plan, the assumption list tells you why.
Why Capacity Planning Goes Wrong
Most capacity planning efforts fail in one of two ways. The first is not doing it at all — teams launch a system, wait until it falls over under load, then scramble to add machines. The second is getting paralysed by the attempt to be precise. Engineers spend weeks building spreadsheets with dozens of variables, trying to predict the exact number of servers needed in eighteen months. Neither approach works.
The problem with the first approach is obvious: you are reacting to failure instead of preventing it. The problem with the second is subtler. You are building a false sense of certainty. The inputs to a precise model — future traffic growth, query patterns, data distribution — are all guesses. Multiplying guesses together doesn't produce accuracy, it produces confident nonsense.
The right approach sits between these extremes. You want an estimate that is good enough to make a decision, produced quickly enough to be useful, with the uncertainty made explicit so you know what to watch.
Capacity planning is not forecasting. It is answering the question: "Given what we know, what is the cheapest way to ensure this system does not fall over?" Precision is not the goal. Adequate headroom is.
The order-of-magnitude mindset
When you're estimating, you don't need to know if the answer is 4,200 or 4,800. You need to know if it's closer to 1,000 or 10,000. The difference between those two is ten times, which changes your architecture. The difference between 4,200 and 4,800 doesn't change anything.
This mental model — thinking in powers of ten — is what separates engineers who are comfortable with estimates from engineers who are paralysed by them. Once you accept that "roughly ten thousand" is a useful answer, estimation becomes fast and approachable.
The question is never "what is the exact number?" It is always one of these:
- Is this problem in the hundreds, thousands, or millions?
- Is this thing fast enough, or is it off by a factor of ten?
- Do we need one machine, ten machines, or a hundred?
The Numbers Every Engineer Should Know
Knowing a small set of hardware and network numbers by heart — not the exact values, but the rough magnitude — lets you sanity-check any design without looking anything up. Here is the reference table. Don't memorise the precise figures. Memorise the category each operation falls into.
The key mental ratios
Rather than memorizing individual values, internalize these ratios. They hold roughly across hardware generations:
| Operation A | Operation B | Ratio | Implication |
|---|---|---|---|
| RAM access | SSD random read | ~100× | In-memory vs. on-disk is not a small difference |
| SSD read | Same-DC network call | ~100× | A remote call costs 100 SSD reads |
| Same-DC call | Cross-region call | ~100× | Geographic distribution is expensive in latency |
| L1 cache | RAM | ~200× | Cache misses in tight loops destroy performance |
Each major tier in the memory hierarchy is roughly 100× slower than the one above it. If your design requires an operation to drop down a tier unexpectedly, expect a 100× latency hit. That turns a 1ms response into a 100ms response.
The Back-of-Envelope Method
Back-of-envelope estimation is a skill, not a talent. It follows a repeatable process. The reason most engineers struggle with it is they try to estimate everything at once. The trick is to break the problem into small, independent pieces, estimate each piece, and multiply them together.
Here is the process step by step:
Step 1: Identify what you are estimating
Be specific. "How much capacity do we need?" is not estimable. "How many application server instances do we need to handle 50,000 requests per second at p99 < 100ms?" is estimable. Before you do any math, write the question down exactly.
Step 2: Write down your assumptions first
Before calculating, list everything you're assuming. Traffic growth rate, average request size, cache hit ratio, read/write ratio, average fanout. Write numbers next to each one. These assumptions become the most valuable part of the estimate — when your system behaves differently than planned, you return to this list to find which assumption was wrong.
Step 3: Break into independent sub-problems
A request to your system touches compute, network, storage, and memory. Estimate each separately. What is the CPU cost per request? What is the memory per open connection? What is the storage growth per day? Each of these can be estimated with simple multiplication.
Step 4: Round aggressively, then sense-check
Use round numbers throughout. 86,400 seconds in a day → call it 100,000. A 1% cache miss rate at 10,000 QPS → 100 misses per second. Round to the nearest power of ten where you can. At the end, sanity-check the result against something you already know. Does this number feel right? Is it in the right ballpark compared to similar systems?
Step 5: Find your load multiple
Your estimate at average load is not the one you design for. Find the ratio of your peak load to your average load. Design for peak. This single step is the one teams most often skip, and it is why systems fall over during product launches and traffic spikes.
Estimating capacity at average load and then adding "a bit of buffer." This almost always produces a system that is fine on a Tuesday afternoon and falls over on a Friday evening, during a launch, or at the end of a billing cycle. You must find your actual peak multiplier.
Three Worked Examples
Example 1: A read-heavy API
You are building a product recommendation API. Users query it every time they visit the homepage. You need to figure out how many servers to run.
Example 2: A write-heavy event pipeline
You are building an event ingestion pipeline for user analytics. Every user action in the app emits an event. You need to size the Kafka cluster and the consumer fleet.
Example 3: A storage system
You are building a user-uploaded photo storage system. How much storage do you need, and how will you serve reads?
How Much Headroom Is the Right Amount
Once you have a capacity estimate for peak load, you need to decide how much spare capacity to maintain on top of that. This is called headroom, and it's a real cost with real value. Too little headroom and a traffic spike takes you down. Too much headroom and you're burning money on idle machines.
The headroom is for three things
First: traffic spikes. Even your peak estimate is an average of peaks. Individual seconds will be spikier than your measurement window suggests. You need headroom to absorb these micro-spikes without requests failing.
Second: deployment time. When you're rolling out a new version, some of your capacity is temporarily unavailable. If you're running at 95% utilization and you take down 20% of your fleet for a deployment, you're now at 120% — and things start failing. You need headroom to deploy safely.
Third: reaction time. When you see utilization trending up — because traffic is growing or because a new feature is heavier than expected — it takes time to provision new capacity. In cloud environments this might be minutes. In on-premise environments it might be weeks. Headroom buys you that time.
The right target utilization
| System Type | Target Utilization | Why |
|---|---|---|
| Stateless services (API servers) | 50–60% at peak | Spiky traffic, easy to add capacity, deployment headroom needed |
| Stateful services (databases) | 40–50% at peak | Much harder to scale quickly, failure is more severe, conservative target |
| Storage systems | 60–70% full | Need time to provision more storage; some systems degrade above 80% |
| Message queues (Kafka) | 50–60% of throughput | Must absorb producer spikes without backpressure propagating upstream |
| Caches (Redis) | 60–70% memory | Eviction at high memory causes unexpected latency spikes in downstream systems |
The utilization target is not based on cost preferences. It's based on how long it takes you to add capacity and how bad the failure mode is when you run out. For a system that takes 30 minutes to scale out and has no graceful degradation, 50% is not too conservative — it's about right.
The auto-scaling caveat
Many engineers assume that auto-scaling eliminates the need for capacity planning. It doesn't. Auto-scaling reduces the risk of under-provisioning, but it introduces its own problems: scale-out lag (new instances take time to warm up), cost spikes (sudden scale-out is expensive), and scaling limits (cloud providers have account-level limits on instance counts).
With auto-scaling, you still need to set minimum capacity (your base floor), maximum capacity (your cost and limit ceiling), and target utilization for scaling triggers. All of these require capacity planning. Auto-scaling does not replace the thinking — it just handles the mechanical provisioning.
Traffic Patterns: Why "Average" Will Mislead You
If there is one thing to take from this chapter, it is this: never design for average load. Systems fail at peak, not at average. The relationship between average and peak is the most underappreciated number in capacity planning.
The daily cycle
Almost every consumer-facing system has a daily traffic cycle. Traffic is lowest in the early hours of the morning (typically 2–4 AM in the primary timezone) and peaks in the evening (7–10 PM). The ratio between trough and peak can easily be 5:1 or 10:1. If you design for average, you are designing for something halfway between your trough and your peak — which means you are undersized for peak by a large factor.
Plot your traffic for a full week before making any capacity decision. Find your daily peak. That is the number you design for.
The weekly cycle
Consumer apps typically see lower traffic on weekdays and higher traffic on weekends. B2B tools show the opposite — lower on weekends, higher Monday through Thursday. If you run a mixed product, you may have a flat weekly pattern. Know which category you're in.
Event-driven spikes
Product launches, marketing campaigns, viral moments, news events — any of these can cause a spike that is 10×, 50×, or 100× your normal peak. You cannot fully capacity-plan for these. What you can do is:
- Make your stateless tier auto-scale so it can grow horizontally within minutes
- Design your stateful tier to degrade gracefully under load (rate limiting, queue-based back pressure) rather than fall over completely
- Have a load shedding plan — a defined point at which you start declining requests in a controlled way rather than letting failures cascade
Growth trends
Your system will have a baseline traffic growth rate. Understanding this rate matters because capacity planning is not a one-time exercise. A system that handles load fine today will be in trouble in six months if you haven't accounted for growth.
Your API handles 10,000 RPS today at 60% CPU utilization. Traffic grows at 15% per month. After 6 months, traffic is 10,000 × 1.15⁶ ≈ 23,000 RPS. Your current infrastructure handles 10,000 / 0.6 ≈ 16,700 RPS at 100% utilization. You will hit your ceiling at month 4. If provisioning takes 2 weeks, you need to start the process at month 3.5. This is the math your capacity plan should be tracking.
The compounding effect of percentile traffic
One subtle mistake: teams measure p50 (median) latency and p50 QPS, then size their systems to handle those numbers with headroom. But at peak load, everything gets worse simultaneously. Latency increases, error rates increase, and tail latency (p99) diverges sharply from median. If you size for median conditions plus headroom, your system will be severely underprovisioned at peak because peak latency is much higher than median latency.
The right approach: measure your p99 latency under peak load conditions, not median conditions. That is the latency number to design for.
Cost and Performance: The Same Knob
Engineers often treat performance optimization and cost optimization as separate concerns — sequential steps in a project. "First we make it fast, then we make it cheap." This framing is a mistake. Cost and performance are the same knob, just turned in different directions.
Adding machines improves throughput and reduces latency (up to a point), but it costs money. Using a faster cache reduces latency and offloads your database, but the cache itself costs money. Every architectural decision is implicitly both a performance decision and a cost decision.
The four cost levers in distributed systems
Compute: CPU and memory. The most obvious cost. Reducing per-request compute cost (through caching, more efficient algorithms, or smaller payloads) directly reduces the machine count needed at a given load.
Storage: Volume and IOPS. Raw storage is cheap. Fast storage (NVMe SSDs, provisioned IOPS) is expensive. Archiving cold data to object storage, compressing data aggressively, and tiering data by access frequency are the primary tools for controlling storage cost.
Network: Often the most surprising cost in cloud environments. Data transfer between availability zones is billed per GB. Data transfer out to the internet is billed per GB, often at higher rates. A system that moves a lot of data around — between tiers, between regions, out to clients — can have network costs that dwarf compute costs. Measure this early.
Licensing and managed service fees: Managed databases, message queues, and SaaS tools add a cost per unit of usage that scales linearly with load. At low scale these are convenient. At high scale they can be dramatically more expensive than self-hosted alternatives.
Cost as a design constraint from day one
The best time to think about cost is during architecture design, not after launch. Architectural decisions made for performance (e.g., keeping everything in memory, using synchronous cross-region replication, running a large number of small services) can have cost implications that only become visible at scale. By then, changing the architecture is expensive.
In practice: for every major architectural choice, add a cost column to your trade-off analysis. "If we go with option A, what does this cost at 10× current load? At 100×?" This doesn't require precise numbers — order-of-magnitude estimates are fine. But it should be part of the decision.
If your cost per unit of load (cost per million API calls, cost per TB stored, cost per million events processed) is not decreasing as you scale, something in your architecture does not scale. Find it before your users find it.
The 80/20 of cost optimization
Cost is rarely evenly distributed. In most systems, a small number of operations account for most of the cost. Before optimizing anything, measure where your money is actually going. Typical findings:
- One or two query patterns account for 80% of database load
- One data type or path accounts for 70% of storage growth
- Cross-zone or cross-region traffic in one part of the system accounts for most network spend
- One expensive third-party API call is responsible for most of the external API bill
Identify your top three cost drivers and fix those first. Everything else is noise by comparison.
From Estimate to Plan
An estimate sitting in a document does not help anyone. The final step of capacity planning is turning your estimate into an actionable plan. This means:
Define your scaling triggers
Pick the metric that most directly correlates with load on each bottleneck in your system — typically CPU utilization, memory utilization, request queue depth, or disk I/O. Define the thresholds at which you act: the "yellow" threshold where you start monitoring closely, and the "red" threshold where you provision more capacity immediately. Write these down in your runbook before you need them.
Build a capacity timeline
Given your current capacity, your target utilization ceiling, and your expected traffic growth rate, calculate the date at which you will hit your ceiling. Work backward from that date by the time it takes to provision capacity, and put a calendar reminder at the right moment. This is the automation of the thinking you've already done.
Current capacity ceiling: 16,700 RPS (at 100% utilization) Target ceiling (60% util): 10,020 RPS Current load: 10,000 RPS ← you are already at your target ceiling Monthly growth rate: 15% Time to provision: 2 weeks → You needed to start provisioning last week.
Review the assumptions regularly
A capacity plan is a living document. Every time you run a major product launch, observe an unexpected traffic spike, or make a significant architecture change, revisit your assumptions. The value is not in the original estimate — it's in the updated estimate that incorporates what you've learned.
Share it with stakeholders
Capacity plans are not just engineering artifacts. Product managers planning launches need to know the load those launches will generate. Finance teams need to know infrastructure cost projections. Leadership needs to know if a strategic bet requires 3× the current infrastructure. The capacity plan, presented as a simple table of expected load vs. current capacity vs. cost, is one of the most useful cross-functional documents a team can produce.