Chapter 8 · Part II: Scalability

Caching — The Fastest Code Is Code That Doesn't Run

Caching is one of the most powerful tools in distributed systems. It is also one of the most dangerous. Done right, it makes your system ten times faster. Done wrong, it silently serves stale data, collapses under traffic spikes, and hides problems that only surface when the cache goes cold.

What's in this chapter

Key Learnings — If You Only Have 5 Minutes

1

A cache is a bet on temporal locality. You're betting that data read once will be read again soon. When that bet is wrong — uniform access patterns, constantly changing data — caching adds overhead without benefit.

2

Cache invalidation is hard because it's a distributed coordination problem in disguise. You need two systems — the cache and the database — to agree on what's current, without a shared transaction boundary.

3

Write-through gives consistency but hurts write latency. Write-behind gives performance but risks data loss. Cache-aside gives flexibility but creates a race condition window. There is no free lunch.

4

Thundering herd happens when your cache saves you from load — until it doesn't. When a popular key expires, every request that was hiding behind it hits the database simultaneously. The solution is locking, jitter, or probabilistic early expiration.

5

Hot keys are a partitioning problem, not a caching problem. If a single key gets 500,000 requests/sec and your cache node handles 100,000 req/sec, adding more cache nodes doesn't help — you need to replicate the hot key or restructure the data.

6

Test your system with a cold cache. Many systems are architected for the steady state (cache warm, hit rate high) and completely untested for the start state. A cold cache at deploy time is the most common cause of post-deploy incidents.

7

Cache hit rate is a lagging indicator. By the time your hit rate drops, users are already experiencing degraded performance. Track it, alert on it, but also build your system to handle a sustained low hit rate without falling over.

Why Caching Works

The core idea behind caching is simple: if you've computed something once, save the result so you don't have to compute it again. That's it. The rest of this chapter is about all the ways that simple idea gets complicated in real systems.

Caching works because of a property called temporal locality — data that was accessed recently is likely to be accessed again soon. Think about a news website. The homepage article gets read by millions of people. The database query to fetch it is the same every time. Without caching, you run that query millions of times. With caching, you run it once, store the result, and serve it from memory for the next hour.

The speedup is dramatic because there's a massive gap between the cost of reading from memory versus reading from a database across a network. Memory access takes roughly 100 nanoseconds. A database read across a network might take 1-10 milliseconds. That's a 10,000x to 100,000x difference. Even a cache hit rate of 90% can reduce your database load by 90% and cut your average response time from 10ms to 1ms.

// Latency comparison — why caching matters so much

L1 cache read ~0.5 ns
Main memory (RAM) read ~100 ns
Redis / Memcached (same AZ) ~0.5 ms (500,000 ns)
Database query (simple, warm) ~2–5 ms
Database query (complex/cold) ~50–500 ms
Disk read (SSD) ~0.1 ms

// A cache hit (Redis) is 4–1000x faster than a database query.
// At scale, that difference compounds into real money and real user experience.

When Caching Doesn't Work

Before you add a cache, ask: does my access pattern actually have temporal locality? There are situations where it doesn't, and caching in those situations wastes memory and adds complexity for no gain.

Uniformly distributed access patterns — if each user reads completely unique data (e.g. personalized financial reports), no single item gets read twice. Cache hit rate will be near zero. You're paying the overhead of the cache layer for nothing.

Data that changes faster than the TTL — if a stock price updates every 100ms and your cache TTL is 1 second, you're serving stale data most of the time. Whether that's acceptable depends entirely on the use case.

Write-heavy workloads — if most of your requests are writes, caching reads doesn't help much. The writes still hit the database, and each write potentially invalidates cached data, adding churn.

Very small datasets — if your entire dataset fits in the database's buffer pool, the database is effectively already a cache. Adding another layer in front of it adds latency for cache misses without much benefit.

The Four Write Strategies

When you add a cache to your system, you immediately face a question: when you write data, what do you write first — the cache or the database? And what happens if one of them fails? There are four main strategies, each with a different answer to these questions.

1. Cache-Aside (Lazy Loading)

This is the most common pattern. The application code manages the cache explicitly. When reading, the application checks the cache first. If the data is there (a cache hit), it returns it. If not (a cache miss), it reads from the database, puts the result in the cache, and returns it.

// Cache-aside read path

READ:
1. Check cache for key K
2. HIT: return cached value (fast)
3. MISS: read from database
4. write value to cache with TTL
5. return value

WRITE:
1. Write to database
2. Invalidate (delete) key K from cache
(or just let the TTL expire it naturally)

Cache-aside has an important property: the cache only contains data that has actually been requested. You're not pre-loading things that might never be read. This keeps memory usage efficient and makes cold starts relatively smooth — the cache fills up naturally as real traffic arrives.

The Race Condition

Cache-aside has a subtle race condition. Thread A reads from the database, gets value V1. Before thread A writes V1 to the cache, thread B writes a new value V2 to the database and deletes the cache key. Then thread A writes the old V1 to the cache. Now the cache has stale data and no one will invalidate it until the TTL expires. This window is usually small, but in high-concurrency systems it's real.

The standard mitigation is to use a short TTL. Even if stale data gets cached, it expires quickly. For most applications, a window of 30–60 seconds of potentially stale data is acceptable. If it's not acceptable for your use case, you need write-through.

2. Write-Through

With write-through, every write goes to the cache and the database at the same time, before the write is acknowledged to the caller. The cache and database are always in sync.

// Write-through

WRITE:
1. Write to cache AND database (synchronously, both must succeed)
2. Acknowledge success to caller

READ:
1. Check cache
2. HIT: return (cache always has current data)
3. MISS: data was never written or was evicted — read from DB, populate cache

The big advantage: the cache is always fresh. There's no stale data window. If data is in the cache, you know it's current.

The cost: every write is slower because it touches two systems. If your write latency was 5ms to the database, now it's 5ms to database + 0.5ms to cache = 5.5ms minimum. That's usually acceptable. But there's a worse problem: what do you do if the cache write succeeds but the database write fails? Now your cache has data the database doesn't. You need to either roll back the cache write or accept the inconsistency.

Write-through also causes another issue called write amplification. Every piece of data gets written to the cache, whether or not it'll ever be read. If you write a billion records and only 1% of them are ever read, you've wasted memory caching the other 99%.

3. Write-Behind (Write-Back)

Write-behind is aggressive about performance. Writes go to the cache immediately and are acknowledged to the caller right away. Writing to the database is deferred — it happens asynchronously in the background after a delay.

// Write-behind (write-back)

WRITE:
1. Write to cache immediately
2. Acknowledge success to caller (fast!)
3. Background process later writes to database (async, batched)

// If the cache node crashes between step 2 and step 3...
→ Data is permanently lost

Write-behind is used in situations where write latency is critical and some data loss is acceptable. SSDs use this pattern internally — writes are cached in a fast buffer and flushed to the slower storage medium later. Your keyboard also uses it: keystrokes are buffered and sent in batches.

Data Loss Risk

Write-behind caches are not durable. If the cache node fails before it flushes to the database, those writes are gone. For most application-level use cases — user data, orders, transactions — this risk is unacceptable. Write-behind belongs in storage systems that manage their own durability, not application caches sitting in front of a relational database.

4. Refresh-Ahead

Refresh-ahead tries to predict what you'll need before you need it. Before a cached item expires, the cache automatically refreshes it in the background from the database. When the item's TTL is, say, 60 seconds, at 50 seconds the cache proactively fetches the latest value and resets the TTL.

The benefit is zero latency on cache misses for popular items — the refresh happens before expiry, so users never see the miss. The downside is that you're continuously reading data from the database whether or not anyone has requested it. If you refresh items that are never actually read again, it's wasted load on the database.

Refresh-ahead is a good fit for items that you know will be requested continuously — the homepage, a global configuration object, a heavily accessed product catalog. For everything else, it's usually overkill.

Strategy Write Latency Read Latency Consistency Risk
Cache-Aside Fast (DB only) Miss = slow Small stale window Race condition on writes
Write-Through Slower (DB + cache) Always fresh Strong Write amplification; partial failure
Write-Behind Very fast (cache only) Fast Lag until flush Data loss on crash
Refresh-Ahead Same as backing store No miss latency Short stale window Wasted reads on cold items

Cache Invalidation: The Hard Part

Phil Karlton famously said: "There are only two hard things in computer science: cache invalidation and naming things." He was right, but it's worth understanding exactly why cache invalidation is hard — because it's not a caching problem, it's a distributed coordination problem.

When data changes in your database, you need the cache to reflect that change. The cache and the database are two separate systems. There is no shared transaction that spans both. You cannot update the database and update the cache as a single atomic operation.

This means you are always choosing between two evils:

The Delete-on-Write Pattern

The most common approach is to delete the cache entry whenever the backing data changes. The next reader takes the miss, reads from the database, and repopulates the cache. This is simple, and simplicity is underrated.

The subtle issue is: when exactly do you delete the cache entry?

Option A: Delete before writing to the database
Problem: between the delete and the write completing, another thread can read the old value from the database (if the write isn't yet committed) and re-populate the cache with stale data.
Option B: Delete after writing to the database
Better, but still has a window: after the write commits but before the cache delete, readers see old data. Usually acceptable.
Option C: Delete after the write AND after a short delay (double-delete)
Wait ~500ms and delete again. This catches the case where another thread had already started reading before the first delete and was about to re-populate the cache with the old value. More robust, but adds complexity.

Event-Driven Invalidation

Instead of the application code managing invalidation, you can use the database's change event stream to drive cache invalidation. Tools like Debezium can read the database's transaction log (the binary log in MySQL, the WAL in Postgres) and emit an event every time a row changes. A consumer picks up these events and deletes or updates the corresponding cache keys.

// Event-driven cache invalidation via CDC (Change Data Capture)

Application ──writes──▶ Database

CDC reader (reads transaction log)

Event stream (Kafka / queue)

Cache invalidation consumer

Cache.delete(key)

// The cache is invalidated via the database log, not the application layer.
// Application code doesn't need to know about cache invalidation at all.

This approach is elegant because it decouples the application from cache management. The cache is eventually consistent with the database, driven by the actual commits, not by application code that might forget to invalidate. The downside is operational complexity — you're now running a CDC pipeline, an event stream, and a consumer, all of which can fail.

The Thundering Herd

The thundering herd is one of the most insidious failure modes in caching, because it turns a cache's most important job — protecting the database — against itself.

Here's how it happens. Your homepage is served by a cache entry with a 5-minute TTL. That entry gets 10,000 requests per minute. At minute 5, the entry expires. In the next few hundred milliseconds, all 10,000 pending requests per minute are now cache misses. They all hit the database simultaneously. The database, which was handling maybe 100 queries per minute (the rest served from cache), suddenly gets 10,000 queries at once. It falls over. Your homepage is down.

The cache entry was protecting the database so well that the database forgot how to handle traffic without it.

Key Insight

The thundering herd is not a rare edge case. Any popular cached item with a TTL-based expiry will trigger it when the TTL fires. The more traffic your cache is absorbing, the worse the thundering herd when expiry happens. High cache hit rate is a warning sign that you need to handle expiry carefully, not a reason to feel safe.

Solving the Thundering Herd

There are three main approaches, and they're often used in combination.

1. Cache Locking (Mutex / Semaphore)

When a cache miss occurs, only one request is allowed to hit the database. The rest wait. Once the first request repopulates the cache, the waiting requests get served from the cache.

// Pseudo-code: cache with mutex lock on miss

value = cache.get(key)
if value is null:
  lock = acquire_lock(key, timeout=500ms)
  if lock acquired:
    value = db.read(key)  // only one thread runs this
    cache.set(key, value, ttl=300s)
    release_lock(key)
  else:
    wait briefly, then retry cache.get(key) // another thread is filling it
return value

This works, but introduces complexity. What if the lock-holder crashes before it can repopulate the cache? You need lock expiry. What if the database read takes longer than the lock timeout? You need to handle that. Distributed locks are their own can of worms (covered in Chapter 33).

2. TTL Jitter

Instead of all items in a category expiring at the same time, add random jitter to each TTL. If you have 1,000 product pages all cached for 5 minutes, they would all expire simultaneously if cached at the same time (e.g., on a deploy). With jitter, each gets a TTL of 5 minutes ± 30 seconds. Expirations are now spread over a 60-second window instead of a single instant. The thundering herd becomes a gentle drizzle.

// Without jitter — all expire at t=300s
item_1: TTL = 300s → expires at t=300
item_2: TTL = 300s → expires at t=300  ← mass expiry, DB spike
item_3: TTL = 300s → expires at t=300

// With jitter — spread over a 60s window
item_1: TTL = 285s → expires at t=285
item_2: TTL = 312s → expires at t=312  ← smooth load
item_3: TTL = 297s → expires at t=297

// Simple implementation:
ttl = base_ttl + random(-jitter, +jitter)

3. Probabilistic Early Expiration (XFetch)

This is the most elegant solution. Instead of waiting for a cache entry to expire before refreshing it, you probabilistically decide to refresh it early — with increasing probability as the expiry time approaches. Items close to expiry get refreshed in the background before they actually expire, so users never see the miss.

The algorithm: when reading a cache entry, compute a "virtual expiration" time that is slightly earlier than the real expiry. If the current time is past the virtual expiration, refresh the cache. The closer to real expiry you are, the higher the probability of triggering a refresh. This means a few background requests do the refresh work, and the thundering herd never materializes.

XFetch in Practice

The XFetch algorithm was formalized in a 2015 paper. In production, a simpler version is common: when remaining TTL drops below 10% of total TTL, some requests (say, 1 in 100) trigger a background refresh. It's simple to implement and eliminates the thundering herd for popular items with no user-visible latency impact.

Cache Stampede

The cache stampede is related to the thundering herd but has a different cause. While thundering herd is about TTL expiry, a stampede is about a cache that didn't exist yet — either because the system just started, the cache was flushed, or a new type of data is being requested for the first time.

Imagine your application deploys. The cache is empty. Traffic arrives. Every single request is a cache miss. Every request hits the database. If your system was designed for a 95% cache hit rate, the database is suddenly getting 20x the queries it was sized for. The database becomes the bottleneck and slows down. As it slows, requests pile up. As requests pile up, connections exhaust. Your database falls over. Your application falls over.

This is called the cold start problem. It's the primary reason many post-deploy incidents happen within 5 minutes of deployment, not at some random later time.

Solutions to Cold Start

Cache warming — before a new instance goes live (or before sending it traffic), run a warming script that reads the most popular keys and populates the cache. You know from your logs and metrics which keys are accessed most. Warm those first.

Gradual traffic ramp-up — don't send 100% of traffic to a new instance immediately. Start at 1%, let the cache fill, then increase. This is exactly what canary deploys do (covered in Chapter 29), and warm-up is one of their underappreciated benefits.

Sticky sessions during rollout — if your cache is local to each instance (an in-process cache), warm one instance before migrating that instance's users to it.

Rate limiting cache misses — apply a request rate limit not just on incoming traffic, but on the rate at which cache misses are allowed to hit the database. This prevents the database from being overwhelmed while the cache warms. Requests beyond the limit either wait briefly or get a graceful degraded response.

Hot Keys

Hot keys are a subtler problem than thundering herd, and they hit systems with distributed caches (like a Redis cluster with multiple shards). The issue is that all requests for a given key go to the same shard — the shard that owns that key. If one key is extremely popular, that one shard becomes a bottleneck, even if all other shards are idle.

A distributed cache solves the problem of total cache capacity. It does not solve the problem of a single key receiving a disproportionate share of traffic. Those are different problems.

Consider a social media platform where a celebrity posts something that goes viral. Every feed refresh checks if that post's data is cached. Suddenly, one Redis shard gets 500,000 requests per second for the same key, while the other 9 shards are handling 50,000 requests per second each. The hot shard becomes the bottleneck for the entire system.

Solving Hot Keys

Local In-Process Cache

Add a small, short-TTL in-process cache in front of your distributed cache. Each application server keeps a local copy of hot keys in memory for 1–5 seconds. Requests never leave the process for those keys. With 100 application servers, you've effectively replicated the hot key 100 times — one copy per server — without any coordination.

// Two-tier cache architecture for hot key mitigation

Request → Local cache (in-process, ~100 items, 2s TTL)
            HIT: return immediately (sub-millisecond)
            MISS:
Redis cluster (distributed, millions of items, 5min TTL)
            HIT: return, populate local cache
            MISS:
Database

Key Replication with Suffix Sharding

Instead of storing a hot item under a single key, store it under multiple keys — one per shard — by appending a random suffix. When reading, pick a random suffix. Now the load for that "key" is spread across multiple shards.

// Hot key: "celebrity_post_98765"
// Normal: all requests go to shard that owns this key

// With suffix sharding: replicate to N shards
write: store "celebrity_post_98765_0" on shard 0
store "celebrity_post_98765_1" on shard 1
store "celebrity_post_98765_2" on shard 2
... up to N shards

read: key = "celebrity_post_98765_" + random(0, N)
// load is now spread across N shards

The tradeoff: invalidation becomes N deletes instead of one. And you need to know upfront which keys will be hot, or detect them dynamically. Some systems (like Dashtable at Twitter) detect hot keys automatically and apply suffix sharding at runtime.

What Caches Hide

A high cache hit rate is wonderful for performance. It also masks problems that will hurt you when the cache isn't there. This section is about the things that only surface when your cache goes cold, gets evicted, or disappears.

Slow Queries You Forgot About

You added a cache in front of a slow database query. The query now runs rarely — only on cache misses. You never notice the 800ms query time because 99% of traffic hits the cache. Then Redis goes down for a maintenance window. Suddenly every request is 800ms. Users see timeouts. Your SLO burns down in minutes.

The fix: treat your caches as voluntary optimizations, not structural requirements. Your system should handle a 0% hit rate — slowly, perhaps, but correctly. Run load tests with an empty cache. Profile and optimize slow queries independently of whether you've added a cache in front of them.

Database Connection Pool Exhaustion

Your database connection pool is sized for your normal traffic profile, which assumes most requests hit the cache. When the cache goes cold, 10x more requests hit the database simultaneously. Even if the database can handle the query load, the connection pool exhausts. Requests queue waiting for a connection. The queue grows. Threads block. Memory fills up. The application server falls over.

Size your connection pool for uncached traffic, not cached traffic. Use connection pool metrics as a leading indicator — if average pool utilization is climbing toward 70%, you're getting closer to the failure threshold.

Stale Data Treated as Truth

In systems with long TTLs, the cache can serve data that is significantly out of date without anyone noticing — because by definition, cache hits don't trigger the code path that reads fresh data. If your business logic has changed, if your data model has evolved, if a bug produced bad data that was then cached — all of that gets served until the TTL expires. And if the TTL is long (days or weeks, which happens in session caches and configuration caches), the stale data can persist for a very long time.

The Long-TTL Trap

Long TTLs are tempting because they reduce cache misses. But they also mean that corrections to bad data take a long time to propagate. If you set a TTL of 24 hours and discover at 9am that you cached incorrect data at 8am, every user who hit that cache entry between 8am and 9am got bad data — and will continue to get bad data until their cached copy expires. The ability to invalidate a cache entry immediately is not optional; it's a production requirement.

Cache Eviction Policies

A cache has limited memory. When it fills up, it must decide what to remove to make room for new items. The eviction policy controls that decision, and the wrong choice can cause surprising behavior under load.

LRU (Least Recently Used) — evicts the item that was accessed longest ago. This is the most common policy and a good default. It preserves recently accessed items, which aligns well with temporal locality. Redis's allkeys-lru mode uses this.

LFU (Least Frequently Used) — evicts the item that has been accessed the fewest times total. Better for access patterns where some items are continuously popular (a top-10 product list) versus just recently popular. But it can keep stale "historically popular" items too long.

TTL-based expiry — items expire at a fixed time regardless of access frequency. Simplest to reason about, and often the right default for application-level caches.

Random eviction — evicts a random item. Surprisingly competitive with LRU in some workloads, and much cheaper to implement. Redis's allkeys-random policy.

No eviction — when the cache fills up, new writes fail. Dangerous for application caches (you'll get errors), but appropriate for caches where losing data is worse than failing writes (session stores, distributed locks).

Measuring Cache Health

You cannot improve what you cannot measure. These are the metrics that matter for cache health.

Hit rate — percentage of requests served from cache. For most application caches, you want this above 90%. Below 80% means you're not getting much benefit and should question the caching strategy. Track this per key namespace, not just globally — a low hit rate on one type of data can hide a good hit rate elsewhere.

Eviction rate — how many items are being evicted per second. A high eviction rate means your cache is too small for the working set. Either increase cache size or reduce the data you're storing. If eviction rate spikes suddenly, your working set size suddenly grew — investigate why.

Memory usage — monitor this continuously. A cache that grows unbounded in memory eventually causes out-of-memory crashes. Set a memory limit in your cache configuration and make sure your eviction policy activates before you hit it.

Miss latency — the latency experienced when a cache miss triggers a database read. This is the latency your users experience for uncached requests. Alert on this separately from cache hit latency. If miss latency degrades, your database is under stress.

Connection count / pool utilization — how many connections are open to the cache. Near-max utilization is a precursor to connection errors and timeouts.

When Not to Use a Cache

This deserves its own section because the instinct to "add a cache" is often the first response to any performance problem, and it's often the wrong one.

When you need strong consistency — if your application cannot tolerate reading data that is even milliseconds old (financial balances, inventory counts, seat reservations), a TTL-based cache is the wrong tool. The cure is worse than the disease. Fix the query instead.

When the problem is write throughput — caches help with read throughput. If your bottleneck is write throughput, you need write-side solutions: batching, write queues, sharding, or a different data model.

When the problem is a missing index — a 500ms database query that should take 2ms is a missing index problem. Putting a cache in front of it hides the problem without fixing it. Someone will one day remove the cache, and the 500ms query will be there waiting for them.

When your data fits in the database buffer pool — a well-tuned database with enough RAM keeps its hot data in an in-memory buffer pool. For small datasets, the database is already effectively serving from memory. An application-level cache adds a network hop and complexity for minimal gain.

A cache is a compensating mechanism. Before reaching for one, ask why the underlying system is too slow. Sometimes the cache is the right answer. Sometimes it's the thing that lets you avoid fixing the real problem for another two years.