Appendix B — Back-of-Envelope Estimation

The Estimation Mindset

Round aggressively. Use powers of 10. A 2x error is acceptable; you are trying to avoid 100x errors. The goal is to answer questions like: "Do we need a cache here?" "Will one database server handle this?" "Does this need sharding?" Getting within one order of magnitude is almost always enough to answer these questions correctly.

Useful Conversion Rules

Rule	What It Gives You
1 million / day	~12 per second (1M ÷ 86,400 ≈ 11.6 req/s)
1 billion / day	~12,000 per second
1 req/s	~2.5 million requests/month
1 KB × 1M users	~1 GB of storage
1 KB × 1B users	~1 TB of storage
100 bytes × 1B events/day	~100 GB/day = ~3 TB/month
1 Gbps sustained	~10 TB/day = ~300 TB/month
8 hours peak = 3× off-peak	Peak QPS ≈ 3× (total / seconds in day)

Example 1: URL Shortener (Read-Heavy)

Design a URL shortening service like bit.ly. Estimate storage and QPS requirements.

Assumptions URL Shortener Scale

100M new URLs shortened per day. 10:1 read-to-write ratio (every URL is read 10 times per day on average). URLs kept for 5 years.

Write QPS

→100M writes/day ÷ 86,400 sec/day = ~1,160 writes/sec
→Peak (3× average) = ~3,500 writes/sec

Read QPS

→100M writes × 10 reads = 1B reads/day
→1B ÷ 86,400 = ~11,600 reads/sec
→Peak = ~35,000 reads/sec

Storage

→Each record: short URL (7 chars) + long URL (~100 chars) + metadata (~50 bytes) ≈ 500 bytes
→100M writes/day × 365 days × 5 years = 182.5B records
→182.5B × 500 bytes = ~91 TB over 5 years

Design Conclusions

Write QPS: ~1,200/s avg, ~3,500/s peak → single DB handles this easily (PostgreSQL does 5K+ TPS)

Read QPS: ~35,000/s peak → DB alone won't cut it; cache is mandatory (Redis at 100K+ ops/s)

Storage: ~91 TB over 5 years → sharding or tiered storage needed at year 2–3

Cache hit rate target: 90%+ would reduce effective DB reads to ~3,500/s — manageable

Example 2: Social Media Feed (Write-Heavy, Fan-out)

Design a Twitter-style feed. 500M users, 100M posts/day, average user follows 200 people.

Assumptions Social Feed at Scale

500M total users, 100M DAU. 100M posts/day. Each user follows 200 accounts on average. Feed pre-computation via fan-out on write. Feeds retained for 7 days.

Post Write QPS

→100M posts/day ÷ 86,400 = ~1,160 posts/sec

Fan-out Write QPS (the hidden cost)

→Each post fans out to 200 followers on average
→1,160 posts/sec × 200 = ~232,000 feed writes/sec
→Peak (3×) = ~700,000 feed writes/sec
→This is why fan-out on write requires a write queue (Kafka) and many workers, not a direct DB write

Feed Storage (7-day window)

→Each feed entry: ~100 bytes (post_id + user_id + timestamp)
→232,000 writes/sec × 86,400 sec × 7 days = ~140B feed entries
→140B × 100 bytes = ~14 TB for feed index (just pointers, not content)

The Celebrity Problem

→A user with 50M followers posting once = 50M fan-out writes in a burst
→At 700K sustained fan-out writes/sec capacity, this single post takes ~70 seconds to fan out
→Solution: hybrid model — pre-compute for normal users, pull-on-read for celebrity accounts

Design Conclusions

Fan-out write pressure (700K/s peak) dwarfs post writes — the write queue is the critical path

14 TB for feed storage fits in Redis cluster; post content (~50 KB avg) adds ~35 TB/day separately

Celebrity accounts (>1M followers) need special-cased pull-on-read to avoid thundering herd on fan-out

Example 3: Video Storage and Streaming

Design a YouTube-scale video platform. Estimate storage, transcoding, and bandwidth.

Assumptions Video Platform at Scale

500 hours of video uploaded per minute (YouTube's real number circa 2023). Average video is 10 minutes at 1080p. 2B daily views, average watch time 7 minutes.

Upload Rate

→500 hours/min = 30,000 hours/hour = 720,000 hours/day
→1080p video ≈ 2 GB/hour of raw storage
→720,000 hours/day × 2 GB/hr = 1.44 PB/day of raw video

Transcoding Multiplier

→Each video is transcoded into ~5 quality levels (360p, 480p, 720p, 1080p, 4K)
→Average encoded size across all versions ≈ 3× raw (multiple bitrates, but encoded efficiently)
→Total storage per day: 1.44 PB × 3 = ~4.3 PB/day
→Per year: 4.3 PB × 365 = ~1.6 EB/year

Streaming Bandwidth

→2B views/day × 7 min avg = 14B minutes of video served/day
→14B min ÷ 1,440 min/day = ~9.7M simultaneous streams
→Average stream at 720p ≈ 2.5 Mbps
→9.7M streams × 2.5 Mbps = ~24 Tbps of egress bandwidth

Design Conclusions

Storage: ~1.6 EB/year → only possible with tiered object storage and aggressive cold-tiering of old content

Bandwidth: ~24 Tbps → impossible without a global CDN; origin servers see a tiny fraction of this

Transcoding: at 500 hrs/min upload, you need thousands of parallel transcoding workers — this is a queue/worker problem, not a web request problem

At this scale, ~0.01% of videos get 90% of views — aggressive CDN caching on popular content is essential

Example 4: Ride-Sharing Location Service

Design a system to track the real-time location of 1M active drivers, queried by riders.

Assumptions Real-Time Location at Scale

1M active drivers at peak, each sending location every 5 seconds. 10M active riders at peak, each querying nearby drivers every 10 seconds. Location stored as lat/lng + timestamp + driver_id.

Write QPS (driver location updates)

→1M drivers ÷ 5 sec interval = 200,000 location writes/sec
→Each write: ~50 bytes (driver_id 8B + lat 8B + lng 8B + timestamp 8B + metadata ~18B)
→Write bandwidth: 200K × 50B = 10 MB/sec — not the constraint

Read QPS (rider queries)

→10M riders ÷ 10 sec interval = 1,000,000 reads/sec
→Each query: find all drivers within, say, 2 km radius
→This is a geospatial range query — expensive on a standard DB at 1M QPS

In-Memory Footprint

→Only the current location per driver needs to be hot (latest update, not history)
→1M drivers × 50 bytes = 50 MB — fits trivially in Redis
→Redis GEO commands support geospatial indexing — this is the natural fit

Design Conclusions

Current state (50 MB) fits in a single Redis instance with room to spare

200K writes/sec and 1M reads/sec needs Redis Cluster (shard by geo region / geohash prefix)

Reads vastly outnumber writes → separate read replicas per shard reduce hot-spot risk

Historical location (for billing/audit) → write to a separate, cheaper, append-only store (Kafka → S3)

Example 5: Rate Limiter at Scale

Design a rate limiter for an API platform: 100M users, limit of 1,000 requests/user/hour.

Assumptions Distributed Rate Limiter

100M users. 1,000 req/user/hour limit. 10% of users are active at any time. Each rate limit check adds at most 1 ms to request latency.

Peak QPS to the Rate Limiter

→10M active users × 1,000 req/hr ÷ 3,600 sec/hr = ~2.8M checks/sec
→Peak (users burst at start of hour): ~8–10M checks/sec

State per User

→Sliding window counter: user_id (8B) + count (4B) + window_start (8B) ≈ 20 bytes/user
→100M users × 20 bytes = 2 GB total state
→Fits in a single large Redis node — but 10M QPS does not

Redis Sharding Need

→Redis single-threaded: ~100K–500K simple ops/sec per node
→10M checks/sec ÷ 300K ops/node = ~33 Redis nodes minimum
→Shard by user_id hash → each node handles ~3M users, ~300K ops/sec

Design Conclusions

State is tiny (2 GB) but throughput (10M ops/s) requires ~33 Redis nodes in a cluster

Local token bucket cache per API server (in-process) can absorb 90% of checks without hitting Redis — only sync on bucket exhaustion

The 1 ms latency budget is easily met with same-datacenter Redis (~0.5 ms round trip) but not with cross-region calls

Rate limiter state is ephemeral — losing it on a Redis restart just resets limits, which is acceptable behavior

The Estimation Process in 5 Steps

Use the same structure for any new problem:

Step	What to Do	Watch Out For
1. State assumptions	Write down the numbers you are assuming: DAU, data size, request frequency. Do this first. Do not calculate first and assume second.	Assuming too-round numbers that are off by 10x from reality
2. Derive QPS	Convert daily/monthly numbers to per-second. Separate read and write QPS. Calculate peak as 2–3× average.	Forgetting that peak load ≠ average load. Systems sized for average will fail at peak.
3. Derive storage	Estimate bytes per record. Multiply by record count. Apply retention period. Add replication factor (typically 3×).	Forgetting replication. A 1 TB dataset stored 3× is 3 TB.
4. Compare to system limits	Look up (or recall from Appendix A) the throughput limit of the storage system you are considering. Is your QPS within that limit?	Comparing to the theoretical max, not the safe operational limit (~70% of max).
5. Draw the design conclusion	The numbers tell you: do you need a cache? Do you need sharding? Do you need a queue? Does a single machine work? Make the architecture follow the math, not the other way around.	Designing the architecture first and then selectively using numbers to justify it.

The Most Common Mistake

Estimating storage and forgetting bandwidth — or estimating bandwidth and forgetting IOPS. A system that fits the storage budget can still fail if the I/O pattern requires 10× more IOPS than the storage tier supports. Always check both dimensions.