Part II — Scalability Chapter 9

Load Distribution

One server can only handle so many requests. Load balancing is how you spread that work across many servers. But the algorithm you pick, where you place the balancer, and how you handle overload all have consequences that quietly shape the reliability and latency of your entire system.

What's in this chapter

How load balancers work at Layer 4 vs Layer 7 — and when the difference matters
Five load balancing algorithms, what each optimises for, and where each breaks
Why "smarter" load balancers often cause more problems than they solve
Client-side load balancing — how gRPC, service meshes, and internal services skip the middleman
Request hedging: how sending a duplicate request can cut your tail latency without increasing load
Backpressure: the only correct response to sustained overload, and why most systems don't implement it
Health checks and the failure mode nobody talks about — a check that passes on a broken server

Key Learnings at a Glance

Quick reference

01 Round-robin works well when requests are roughly the same size. It falls apart when a few slow requests pile up on one server while others sit idle.

02 Least connections routes to the server with the fewest active requests. Better for variable-length work, but adds overhead in high-throughput systems.

03 Power of two choices — pick two servers at random, send to the less loaded one — gives 80% of optimal load balancing with almost no coordination cost.

04 Consistent hashing is right when you need request affinity — caching, stateful sessions, sharding. Use it deliberately, not by default.

05 A load balancer that does too much (SSL termination, auth, rate limiting, retries) becomes a bottleneck and a single point of failure.

06 Request hedging — send a second copy of a slow request to another server — can dramatically reduce tail latency on read-heavy workloads at low cost.

07 Backpressure is the correct response to overload: tell callers to slow down or reject early. Queueing everything indefinitely makes the problem invisible until it explodes.

08 A health check that passes on a degraded server is worse than one that fails. Design health checks that test actual capacity, not just "am I running?"

The Problem of Too Many Requests

Imagine you run a small bakery. At first, one person can take orders, bake, and serve customers. Then you get popular. One person can't keep up. So you hire more people. But now you have a new problem: how do you decide which person handles each customer? You don't want three people standing idle while one person has a queue of twenty customers.

Servers have exactly the same problem. A single web server has limits — a finite number of CPU cores, a finite amount of memory, a limit on how many network connections it can hold open at once. When your traffic grows beyond what one machine can handle, you add more machines. And then you need something to decide which machine handles each request. That something is a load balancer.

Load balancing sounds simple. In practice, the algorithm you choose, the layer at which you balance, and how you handle overload all have real consequences for latency, throughput, and what happens when something goes wrong. This chapter goes through each of those decisions carefully.

What a Load Balancer Actually Does

At its core, a load balancer sits between clients and servers. It accepts incoming connections, looks at each request, and forwards it to one of the backend servers. The client usually doesn't know which backend handled its request — it just talks to the load balancer.

Without load balancing: Client A ──────────────────────────────→ Server 1 (overwhelmed) Client B ──────────────────────────────→ Server 1 Client C ──────────────────────────────→ Server 1 Server 2 (idle) Server 3 (idle) With load balancing: Client A ──→ ┌─────────────┐ ──→ Server 1 Client B ──→ │ Load │ ──→ Server 2 Client C ──→ │ Balancer │ ──→ Server 3 Client D ──→ └─────────────┘ ──→ Server 1

The load balancer needs to maintain a list of available backend servers and continuously check which ones are healthy. When a request arrives, it picks a server according to an algorithm (more on this shortly), opens a connection to that server, and proxies the request and response back and forth.

Some load balancers are dedicated hardware appliances. Others are software running on commodity machines (HAProxy, Nginx, Envoy). Many cloud providers give you a managed load balancer as a service (AWS ALB/NLB, GCP Cloud Load Balancing). And increasingly, load balancing is done inside the application itself — we'll get to that in the client-side load balancing section.

Layer 4 vs Layer 7 — Why the Distinction Matters

The terms "Layer 4" and "Layer 7" come from the OSI networking model. Layer 4 is the transport layer (TCP, UDP). Layer 7 is the application layer (HTTP, gRPC, WebSocket). The difference matters because it determines how much the load balancer can see about your traffic.

Layer 4 Load Balancing

A Layer 4 load balancer works at the TCP/UDP level. It sees the source IP, destination IP, and ports — but not the contents of the packets. When a TCP connection comes in, it picks a backend and forwards the raw byte stream to that server. The load balancer doesn't know whether you're sending HTTP, gRPC, or something else.

Because it doesn't inspect payloads, Layer 4 load balancing is extremely fast. It can handle millions of connections per second on commodity hardware. AWS NLB (Network Load Balancer) is a Layer 4 product and can handle tens of millions of requests per second with single-digit millisecond latency.

The limitation is that you can only route based on connection-level information — IP address, port, protocol. If you have two different services on the same port, or you want to route based on URL path or HTTP headers, Layer 4 can't help you.

Layer 7 Load Balancing

A Layer 7 load balancer understands the application protocol. It can read HTTP headers, inspect the URL path, look at cookies, and even read the request body. This lets you do much more intelligent routing:

Route /api/images/* to an image processing cluster and /api/checkout/* to your payment servers
Route based on a tenant ID in a header (useful for multi-tenant systems)
Terminate TLS and inspect the plaintext request before forwarding
Rewrite headers, inject tracing IDs, or perform authentication
Route gRPC method calls to different backends

The cost is more CPU and latency. The load balancer has to parse each request, which takes time. AWS ALB (Application Load Balancer) is a Layer 7 product — it's more capable than NLB but also higher latency and lower maximum throughput.

◈

Rule of thumb

Use Layer 4 when you need raw throughput and simple routing. Use Layer 7 when you need content-based routing, TLS termination, or protocol-aware features. Many architectures use both: a Layer 4 load balancer on the edge for performance, with Layer 7 routing happening closer to the application.

Load Balancing Algorithms

The algorithm decides which backend server gets the next request. This is not a trivial decision — the wrong algorithm can leave some servers idle while others are crushed, or cause latency spikes under uneven workloads. Here are the five algorithms you'll encounter most often.

Round-Robin

Round-robin is the simplest algorithm. You have a list of servers: Server 1, Server 2, Server 3. The first request goes to Server 1. The second to Server 2. The third to Server 3. The fourth back to Server 1. And so on, cycling through the list.

Round-Robin distribution: Request 1 → Server 1 Request 2 → Server 2 Request 3 → Server 3 Request 4 → Server 1 ← cycles back Request 5 → Server 2 ... After 9 requests: each server has handled exactly 3 requests. ✓

Round-robin works well when all requests are roughly the same size and take roughly the same time to process. If you're serving small, uniform API calls, round-robin is perfectly good.

But consider what happens when requests have variable processing times. Say some requests take 1 millisecond and some take 10 seconds (a long-running export, for example). Server 1 gets a 10-second request and is busy for a while. Meanwhile, round-robin keeps sending new requests to Server 1 while it's still working on the slow one. Those new requests queue up, latency increases, and Server 2 and Server 3 sit idle.

⚠

Where round-robin breaks

Round-robin is a bad choice for workloads with high variance in request processing time. If 1% of your requests take 1000x longer than average, round-robin will create serious load imbalance. It's also a bad choice when servers have different hardware capacities — a smaller server gets the same number of requests as a large one.

Weighted round-robin addresses the heterogeneous hardware problem. You assign a weight to each server — a server with 4x the capacity gets 4x as many requests. It still doesn't solve the variable-request-time problem, but it does let you mix server sizes in your fleet.

Least Connections

Least connections is a smarter algorithm. Instead of cycling through servers blindly, it keeps track of how many active connections (or requests) each server currently has. Each new request goes to the server with the fewest active connections.

Least connections state (after some requests): Server 1: ████████████ 12 active connections Server 2: ███ 3 active connections ← next request goes here Server 3: ██████ 6 active connections

This naturally handles the variable-request-time problem. If Server 1 is handling a slow request, it accumulates connections faster than Servers 2 and 3. New requests will route away from it automatically.

The downside is that the load balancer needs to track active connection counts for all backends and update them as connections open and close. At very high throughput — say, hundreds of thousands of short requests per second — this tracking adds meaningful overhead. For most applications, this is fine. If you're building a high-frequency trading system or a DNS server handling millions of tiny UDP packets per second, it might matter.

Power of Two Choices

Power of two choices is an elegant algorithm from academic research that was popularised by systems like Nginx and various cloud load balancers. Here's how it works:

Pick two backend servers at random
Compare their current load (connection count, queue depth, or another metric)
Send the request to the less loaded of the two

This sounds almost too simple. Why not just pick the globally least-loaded server?

The problem with "pick the least loaded server globally" is that in a distributed system with many load balancer instances, each instance has a slightly stale view of the world. If everyone sees Server 3 as the least loaded and all sends to it simultaneously, Server 3 gets a thundering herd. This is called the herding problem.

Power of two choices sidesteps this elegantly. By randomly sampling just two servers, different load balancer instances naturally pick different pairs. There's no coordination needed. And mathematically, sampling two and picking the better one produces load distribution that is exponentially better than random but nearly as good as perfect global knowledge. The expected maximum load on any server drops from O(log N / log log N) with random routing to O(log log N) — a dramatic improvement.

✦

When to use it

Power of two choices is a great default for stateless services with many backend replicas. It's simple, requires no coordination between load balancer instances, and produces near-optimal load distribution. Netflix, Lyft, and many others use this in their service meshes.

Consistent Hashing

All the algorithms above are stateless — each request is independent, and the same client might hit a different server on every request. That's fine for pure stateless services. But sometimes you want the same client — or the same data — to consistently go to the same server. This is called request affinity or sticky routing.

Two common reasons you'd want this:

Caching: If requests for the same user always go to the same server, that server's local cache warms up for that user. If you scatter requests randomly, every server has a cold cache for every user, and you get much lower cache hit rates.
Stateful sessions: Some older applications keep session state in server memory. The session only exists on one server, so subsequent requests from the same user must go to the same server.

The naive implementation is a hash map: hash(user_id) % number_of_servers. User 1234's hash maps to Server 2, so all of User 1234's requests go to Server 2. Simple.

The problem appears when you add or remove a server. With 3 servers, user 1234 goes to 1234 % 3 = 2, which is Server 2. If you add a fourth server, user 1234 now goes to 1234 % 4 = 2 — still Server 2 by coincidence, but user 1235 goes from 1235 % 3 = 2 to 1235 % 4 = 3. In fact, adding one server causes nearly all users to reroute, invalidating nearly all your caches at once.

Consistent hashing fixes this. The idea is to map both servers and requests onto a circular ring of hash values (imagine a clock face with 2^32 positions). Each server occupies a position on the ring. A request is assigned to the first server you encounter travelling clockwise from the request's hash position.

Consistent hash ring: 0 │ Server A ● ╱ ╲ ╱ ╲ Server C ● ● Server B ╲ ╱ ╲ ╱ ● Server D Request hash 42 → closest server clockwise → Server B Request hash 91 → closest server clockwise → Server C When Server B is removed: Only requests previously handled by Server B reroute to Server C. All other assignments stay the same.

When you add a server, it only takes over some requests from its neighbours on the ring. When you remove a server, only its requests redistribute. On average, adding or removing one server reshuffles only 1/N of requests, where N is the number of servers.

One practical issue: with few servers, the ring positions might not be evenly distributed, causing some servers to get more requests than others. The solution is virtual nodes — each physical server occupies multiple positions on the ring (often 100-200 positions). This makes the distribution much more even.

Algorithm	Best for	Weakness	Overhead
Round-Robin	Uniform, short requests	Variable request times	Minimal
Least Connections	Variable-length requests	High-throughput small requests	Low-medium
Power of Two Choices	High throughput, many replicas	Needs connection count metrics	Low
Consistent Hashing	Cache affinity, sharding	Hot keys send all traffic to one server	Low-medium
Random	Very large, uniform fleets	Poor for small fleets	Minimal

The Smart Load Balancer Trap

Modern load balancers can do a lot more than just route requests. They can terminate TLS, handle authentication, rate-limit clients, retry failed requests, collect metrics, do A/B testing, and transform request headers. It's tempting to centralise all this logic in the load balancer. Please resist this temptation.

Here's why. Every piece of logic you add to a load balancer makes it:

Slower — each request has to go through more processing
Harder to scale — you now need to scale the load balancer itself
A bigger single point of failure — when the load balancer has a bug, it affects everything
A deployment bottleneck — changing auth logic means touching the load balancer, which handles all services

Consider a load balancer that does TLS termination, JWT validation, rate limiting, and retries. An engineer wants to change the rate limiting algorithm. To do so safely, they have to deploy a new version of the load balancer. That affects every service in the cluster simultaneously. The blast radius of a bad deployment is enormous.

The fat load balancer antipattern

When your load balancer does auth, rate limiting, request transformation, A/B testing, canary routing, and retries, it has become a distributed systems god object. It's now the hardest component to change and the most dangerous one to have bugs in. Teams fight over it. Deployments become events. This is how load balancers become the most expensive part of your infrastructure in both money and operational pain.

The better pattern is to make your load balancer do one thing extremely well: route requests to healthy backends. Put auth logic in an auth service or sidecar. Put rate limiting in your API gateway, which is itself a small, focused application. Put retry logic in your client libraries (with proper exponential backoff and jitter, as we discussed in Chapter 11).

Client-Side Load Balancing

Traditional load balancing puts a proxy between the client and the servers. Every request has to pass through that proxy. This adds a network hop and makes the proxy a potential bottleneck.

Client-side load balancing eliminates the proxy entirely. The client itself knows about all the available backend servers and applies a load balancing algorithm before making the connection. There's no middleman — the client connects directly to a backend.

Traditional (proxy) load balancing: Client ──→ Load Balancer ──→ Server 1 ↘ Server 2 ↘ Server 3 # Every request goes through the LB Client-side load balancing: Client ──────────────────→ Server 1 (picks Server 1 using P2C) Client ──────────────────→ Server 3 (picks Server 3 on next request) # Client connects directly, no proxy

How does the client know which servers exist? It queries a service registry — a system that keeps an up-to-date list of healthy servers. Consul, Kubernetes DNS, and etcd are common choices. The client polls the registry periodically or subscribes to updates.

gRPC uses client-side load balancing by default in many configurations. Service meshes like Envoy and Linkerd use a variation: instead of the client itself doing the balancing, a sidecar proxy runs alongside each service instance. The sidecar handles all outbound and inbound connections, applying load balancing, retries, and circuit breaking. From the application's perspective, it just talks to localhost; the sidecar does the rest.

Client-side load balancing is faster (no extra network hop), more resilient (no single proxy to fail), and scales naturally (each client instance does its own balancing). The trade-off is complexity: now every client needs to implement load balancing logic, health checking, and connection management. Sidecar proxies solve this by externalising that logic from the application code.

Tail Latency and Request Hedging

When you look at service latency, the average (p50) might look great — say, 10 milliseconds. But the 99th percentile (p99) might be 2 seconds. This means 1 in 100 requests takes 200x longer than average. That's what engineers mean by tail latency.

Tail latency is a real user experience problem. Google's research showed that even a 1-second increase in page load time causes a measurable drop in user engagement. Amazon found that every 100ms of latency cost them 1% of sales. And because a complex request often calls many backend services, the tail latency of the overall request is the maximum tail latency of any individual service call — not the average. If you call 10 services each with 1% p99 slow requests, roughly 10% of your overall requests will be slow.

Why tail latency happens

Slow individual requests are caused by many things: garbage collection pauses in JVM applications, lock contention, a cold CPU cache, a disk I/O that had to wait, a kernel scheduling delay, or simply a request that happened to hit a code path that does more work. Some of these are random, some are correlated with load.

Request hedging

One powerful technique to reduce tail latency is request hedging (also called speculative execution). The idea is simple:

Send a request to Server A
If no response arrives within a threshold (say, the p95 latency — fast enough that most responses have already arrived), send the same request to Server B
Use whichever response arrives first, and cancel the other

Request hedging in action: t=0ms Client ──→ Server A (primary request) t=50ms [no response yet, threshold crossed] t=50ms Client ──→ Server B (hedge request) t=60ms Server B responds ✓ t=60ms Client uses Server B's response t=62ms Client cancels request to Server A # Without hedging: waited for Server A (which took 200ms due to GC pause) # With hedging: got response in 60ms, much better

The beauty of this technique is that it only costs extra traffic for the small fraction of requests that cross the threshold. If you set the threshold at the p95 latency, only 5% of requests will trigger a hedge. If the hedge responds faster, you've converted a slow tail request into a fast one. The total increase in backend traffic is small — around 5% in this example.

Hedging works well for idempotent read operations. You can safely send a read request twice and use either response. For writes, it's dangerous — you might commit the same write twice if both servers respond before you can cancel. Never hedge non-idempotent writes without careful deduplication logic.

✦

Hedging in practice

Google uses hedging extensively in Bigtable, Spanner, and their RPC framework. The Cassandra client driver supports speculative execution (the same idea). If you have an internal read-heavy service with a noticeable p99/p999 gap, hedging is one of the most cost-effective ways to smooth it out without changing your server-side code.

Backpressure — The Correct Response to Overload

Everything we've discussed so far assumes your backend servers can handle the load you're sending them. What happens when they can't?

The naive response is to queue requests. Your load balancer has a pending request queue. When the backends are busy, new requests sit in that queue and wait. Eventually a server frees up and processes the next request. This feels safe — no requests are lost, and eventually everything gets handled.

In practice, this is one of the most common causes of catastrophic failures in distributed systems.

Why queuing under overload is dangerous

Think about what happens as the queue grows. Requests wait longer and longer. From the client's perspective, the service is extremely slow. Many clients have timeouts — if no response arrives within (say) 5 seconds, they give up and retry. Now you have both the original request and the retry sitting in the queue. The client times out again and retries again. The queue grows faster. More retries arrive. The backend is now doing twice the work, half of which is for requests whose clients have already given up.

This self-reinforcing spiral is called a retry storm or metastable failure. The system reaches a state where it's fully loaded processing useless work (retries for dead clients) and can never catch up. The only way out is to restart the service or manually drain the queue.

The retry spiral: t=0s System at 100% capacity. Queue starts growing. t=2s Queued requests exceed client timeout. Clients retry. t=4s Queue now has original + retry copies. System at 150% load. t=6s Second-generation retries arrive. System at 200% load. t=8s System completely unresponsive. Retries dominate all traffic. t=? Cannot recover without external intervention.

Backpressure: fail fast, fail loudly

The correct response to overload is backpressure: tell your callers explicitly that you are overloaded, so they can stop sending. This is the opposite of queuing. Instead of silently absorbing work you can't handle, you refuse it up front.

In HTTP, this is a 503 Service Unavailable response, ideally with a Retry-After header telling the client when to try again. In gRPC, it's a RESOURCE_EXHAUSTED status. The key is that the response comes back immediately — the client gets a fast error and can decide what to do, rather than waiting in a queue until it times out.

From the client's perspective, receiving a 503 with backoff guidance is much better than a 30-second timeout. The client can back off exponentially, try a different region, or surface a clear error to the user immediately.

Implementing backpressure

There are several mechanisms for implementing backpressure. The right choice depends on your system:

Concurrency limits

Each server maintains a count of its currently active requests. If that count exceeds a threshold, new incoming requests are immediately rejected with a 503. This is sometimes called a concurrency limiter or bulkhead (more on bulkheads in Chapter 11). The threshold should be set conservatively — well below what actually causes the server to fail, so there's headroom.

# Simplified concurrency limiter pseudocode
class ConcurrencyLimiter:
    def __init__(self, max_concurrent: int):
        self.max_concurrent = max_concurrent
        self.active = 0                        # atomic counter

    def handle_request(self, request):
        if self.active >= self.max_concurrent:
            return Response(503, "Too busy, try again later")

        self.active += 1
        try:
            return process(request)
        finally:
            self.active -= 1

Queue with size limit

A small, bounded queue is acceptable — it smooths over short bursts. But the queue must have a hard maximum size. When the queue is full, new requests are rejected. The queue should be short enough that the maximum wait time is within your latency SLO. If your SLO is 500ms and each request takes 10ms to process, a queue of 50 items might be reasonable. A queue of 50,000 items is not.

Token bucket rate limiting

A token bucket maintains a bucket of tokens. Tokens refill at a fixed rate (say, 1000 tokens per second). Each incoming request consumes one token. If the bucket is empty, the request is rejected. This limits the rate of incoming work regardless of how long each request takes.

Token bucket: Tokens refill at: 1000 / second Bucket capacity: 2000 tokens (allows short bursts) t=0.0s Bucket: 2000 tokens t=0.5s 500 requests arrive → bucket: 1500 tokens t=1.0s 1200 requests arrive → bucket: 800 tokens (1500 - 1200 + 500 refill) t=1.5s 1500 requests arrive → 800 - 1500 = -700 → 700 requests rejected t=2.0s Bucket refills to 1300 tokens (500 refill + 800 from rejected = 1300) # Legitimate burst absorbed, sustained overload rejected cleanly

The important insight is: the best time to reject a request is the moment it arrives, not after it has waited in a queue for 30 seconds. A fast rejection uses almost no resources and gives the caller useful information immediately.

The overload principle

A system that queues everything until it collapses is not more reliable than one that rejects cleanly. It just collapses more slowly and more catastrophically. Explicit backpressure with fast rejection is a sign of a well-designed system, not a fragile one.

Health Checks and Their Lies

Load balancers need to know which backend servers are healthy so they can stop routing to broken ones. They do this through health checks — periodic probes sent to each server. If a server fails a check, it's removed from the pool. When it starts passing again, it's added back.

This sounds straightforward. In practice, health checks have subtle failure modes that can make your system behave worse than if you had no health checks at all.

The shallow health check problem

The most common health check is a simple HTTP GET to a /health or /ping endpoint that returns 200 OK. The server is "healthy" if it responds. But what does this actually tell you?

It tells you the server is running and can accept HTTP connections. It says nothing about whether:

The database connection pool is exhausted and all requests will fail
The server is under heavy GC pressure and responding to health checks while queuing all real requests
A dependency (auth service, cache, message queue) is down and all business logic will fail
The server's disk is full and writes are failing silently

A server in any of those states will pass a shallow health check and continue to receive traffic — traffic that will mostly fail. This is often worse than the server being marked unhealthy, because the caller expects the server to handle requests but it can't.

Deep health checks

A deep health check tests actual readiness to serve real traffic. At minimum, it should:

Check that database connections are available and queries work
Check that the server isn't over its concurrency limit
Verify any critical dependencies are reachable
Return a meaningful status with detail, not just 200/500

# Example deep health check response
{
  "status": "degraded",
  "checks": {
    "database":   { "status": "ok",       "latency_ms": 2  },
    "cache":      { "status": "failing", "error": "connection refused" },
    "disk_space": { "status": "ok",       "free_gb": 45    },
    "concurrency":{ "status": "ok",       "active": 12, "max": 200 }
  }
}

The health check thundering herd

Health checks themselves can become a problem. If you have 100 load balancer instances each checking 50 backend servers every second, that's 5,000 health check requests per second hitting your backends. For a small endpoint that just returns 200, this is usually fine. But if your health check queries the database or does expensive work, you may be adding significant load just to check if the server can handle load.

Keep health check endpoints cheap. Do the minimum work needed to give a reliable signal. If checking the database, use a cached result with a short TTL rather than executing a new query on every probe.

Hysteresis: don't flap

When a server is borderline — sometimes passing health checks, sometimes failing — the load balancer might rapidly add and remove it from the pool. This is called flapping. Requests that land on the server during a brief "healthy" window might fail, and the constant pool changes create noise in your metrics.

Good load balancers implement hysteresis: require a server to fail multiple consecutive checks before marking it unhealthy, and require multiple consecutive successes before marking it healthy again. A common pattern is: fail 3 checks to remove from pool, pass 2 checks to add back. This prevents a flaky server from causing constant churn.

Key Principle

Load distribution is not just about routing — it's about how your system behaves when traffic is uneven, when servers are slow, and when you're out of capacity. The algorithm you pick shapes your tail latency. The backpressure strategy you choose determines whether you fail gracefully or catastrophically. Health checks determine whether your "healthy" servers are actually healthy.

Three Questions for Your Next Design Review

What's the shape of your request workload? Are requests uniform in processing time, or do you have high variance (fast reads mixed with slow exports)? If high variance, does your load balancing algorithm account for that, or will slow requests pile up on one server?
What happens when your load balancer receives 3x normal traffic? Does it queue requests indefinitely? Does it have a bounded queue with fast rejection? Does it communicate backpressure to callers with enough information for them to back off intelligently?
What does your health check actually test? If your database connection pool were exhausted, would the health check still return 200? If yes, you have a shallow health check — and degraded servers will silently receive and fail real traffic.