Part I — Foundations Chapter 1

The Eight Fallacies Are Still Killing Projects

The most dangerous assumptions aren't the ones you haven't heard of. They're the ones you know, nodded at in a talk once, and then forgot while writing code.

What's in this chapter

In 1994, a group of engineers at Sun Microsystems listed eight assumptions that developers — even experienced ones — quietly make about the network. They called them the Fallacies of Distributed Computing. Thirty years later, these same assumptions are still the root cause of some of the most painful production incidents.

This chapter goes through each fallacy. Not as a list to memorize, but as a practical guide to the specific bugs, failures, and design mistakes each one creates — and what to do about them.

Why networks drop messages — and how often
The latency numbers that should guide your design
Why bandwidth limits surprise you at scale
Security holes baked in by ignoring fallacy 4
How topology changes break "stable" systems
Multi-team ownership and config drift
The hidden cost of moving bytes across the network
Encoding mismatches that cause silent data corruption

Key Learnings

Short on time? These are the ideas worth carrying out of this chapter.

01 Networks drop, delay, and reorder messages — not rarely, but routinely. Any code path that doesn't handle this has a latent bug.
02 A network hop to a database in the same datacenter takes ~500 microseconds. A cross-region call takes 50–150ms. Treating these as equivalent destroys user experience and throughput.
03 Bandwidth feels infinite until you serialize a large object millions of times a second. Payload size is a design decision, not an afterthought.
04 Trusting internal network traffic is how attackers move laterally after breaching one service. Zero trust means authenticating every call, even internal ones.
05 Hardcoded IPs, service discovery without health checks, and clients that don't reconnect — these break every time a deploy, scaling event, or failover changes the topology.
06 When multiple teams own different parts of your infrastructure, config drift is inevitable. The solution is policy-as-code, not coordination meetings.
07 Cloud providers charge for data leaving a region. Architectures that shuffle data across regions can have infrastructure bills that dwarf compute costs.
08 Encoding mismatches — different OS, different language runtimes, different byte orders — cause bugs that only appear in production, under specific conditions, and are extremely hard to reproduce.

The Gap Between Knowing and Designing

Ask any experienced engineer if the network is reliable. They'll say no. Ask them if latency is zero. Of course not. Ask if bandwidth is infinite. Never.

Then look at their code.

You'll find HTTP calls with no timeout. Retry logic that fires on every error, including ones that can't succeed on retry. Services that expect a downstream call to return in "a few milliseconds" because it always has in development. Config files with hardcoded IP addresses that haven't changed in two years — and will break the moment someone scales the database.

The fallacies are not about ignorance. Most engineers know them. The problem is that knowing something and designing as if it's true are two different skills. This chapter is about closing that gap.

The eight fallacies were first articulated by Peter Deutsch at Sun Microsystems in 1994, and later expanded with James Gosling (the creator of Java). They were written down because Sun engineers kept seeing the same mistakes, project after project, made by smart people who knew better. That is still true today.

"The network is reliable."

Reality: Messages get lost. Connections drop. Packets arrive corrupted.

At first glance this feels obvious. Of course networks fail. But the failure modes are more subtle than "connection refused." A TCP connection can appear open — the OS hasn't noticed it's broken yet — while messages sent into it disappear silently. A request can leave your server and never arrive at the destination. A request can arrive, be processed, and the response can be lost. From the caller's perspective, all of these look identical: you sent a message and heard nothing back.

This is the fundamental problem. You cannot, by waiting, tell the difference between "the server is slow" and "the message was lost" and "the server processed it but the response was lost." Distributed systems researchers call this the Two Generals Problem — certainty across an unreliable channel is mathematically impossible.

Real Numbers

Google published data showing that in their datacenters, roughly 0.01% of packets are dropped in normal operation. That sounds tiny. But if a user action triggers 100 service calls (not unusual in a microservices architecture), the probability that at least one call has a problem is 1 − (0.9999)^100 ≈ 1%. That's one in every hundred user requests hitting a network issue — before you've written a single bug.

The engineering response to this fallacy is two-pronged. First, every network call needs a timeout — a maximum time you're willing to wait before concluding that something has gone wrong. Second, operations that matter need to be idempotent, meaning you can safely retry them without causing double-effects.

The Common Mistake

Setting no timeout at all, or setting it to a round number like 30 seconds "to be safe." A 30-second timeout means a thread is held for 30 seconds on every failure. Under load, this exhausts your thread pool and brings down the caller, not just the downstream service. Timeouts should be calibrated to the 99th percentile of normal latency, not picked arbitrarily.

There is a third, deeper implication. When you receive no response, you are in a state of genuine uncertainty about what happened on the other side. Your system needs to be designed to handle this uncertainty — not assume success, not assume failure, but be able to reconcile state later when you do get information. This concept runs through much of distributed systems design: act on what you know, record what you don't, reconcile when you can.

"Latency is zero."

Reality: Every hop costs time. That time adds up faster than you think.

Latency is the time it takes for a message to travel from one place to another. In a local function call, this is measured in nanoseconds. Across a network, it's measured in microseconds or milliseconds. Those don't sound far apart, but they're different by a factor of a thousand or more.

The Numbers You Should Know by Heart

L1 cache reference: 0.5 ns

Main memory reference: 100 ns

SSD random read: ~100 μs (100,000 ns)

Network round-trip, same datacenter: ~500 μs

Network round-trip, same region (different AZ): ~1–2 ms

Network round-trip, US East to US West: ~70 ms

Network round-trip, US to Europe: ~100–150 ms

The reason these numbers matter is that they shape what is architecturally possible. If your API needs to respond in under 100ms and each database call takes 1–2ms, you have a budget of about 50 database calls before you're in trouble — and that budget assumes nothing else is happening. Add logging, serialization, authentication, and a couple of upstream API calls, and your budget disappears fast.

The mistake that violates this fallacy in code is what's called the N+1 problem: making a database or service call inside a loop. Fetch 100 users, then fetch each user's profile in a loop. That's 101 network calls where 2 would do. In development on your laptop, with a local database, this runs in milliseconds. In production, with a remote database under load, it runs in seconds.

Real-World Example

A team at a large e-commerce company noticed their checkout API was fast on most days but randomly slow on others. The root cause: they were loading the user's cart, then loading each item in the cart one by one from a product service. On Fridays, carts were bigger. More items meant more calls. With a 2ms average per call and 40 items in a cart, that's 80ms just for product lookups — before anything else happened. The fix was a single batch call to fetch all products at once.

Latency also has a tail problem. Your p50 latency (the median) might be 20ms. Your p99 (the 99th percentile) might be 200ms. Your p99.9 might be 2 seconds. If a single user request fans out to 10 services in parallel, the response time is determined by the slowest of those 10 calls. Statistically, the chance that at least one of 10 calls hits the p99 is much higher than the chance that a single call does. This is why high-percentile latency ("tail latency") matters even if most calls are fast.

Design Insight

One technique for fighting tail latency is called request hedging: send the same request to two servers simultaneously and use whichever responds first. You pay a small extra cost in server load, but you cut your tail latency significantly. Google's Bigtable paper described using this for reads — if a replica doesn't respond within a threshold, send the same request to a second replica. The cost is low; the benefit at the tail is large.

"Bandwidth is infinite."

Reality: Bandwidth is finite, and serialization is expensive.

Modern datacenter networks can move 10–100 Gbps between servers. That's genuinely fast. It's easy to look at that number and assume bandwidth will never be a constraint. It will.

The problem isn't usually a single large transfer. It's many small objects, serialized and sent millions of times per second. Consider a service that serializes a 10KB JSON object on every request. At 10,000 requests per second, that's 100MB/s of network traffic just for that one object type. Add compression, logging, health checks, and metric scraping, and you're moving significant data — even if each individual message looks small.

More importantly, serialization is CPU-intensive. Converting a Java object to JSON and back requires CPU time on both ends. At low throughput this is invisible. At high throughput it shows up as CPU pressure that looks inexplicable until you profile it.

The Fat Response Problem

APIs that return entire objects when callers only need a few fields are a common bandwidth waste. A mobile API that returns a full user profile — 50 fields — when the screen only shows a name and avatar is sending ~10x more data than necessary. At scale, this costs real money in bandwidth and real seconds in user-perceived load time, especially on mobile networks.

The design implication is that payload size is a design decision, not an afterthought. When designing APIs, think about what the caller actually needs. When designing data pipelines, think about what fields are being carried through the system and whether all of them are necessary at each stage.

This also affects protocol choice. JSON is human-readable and easy to debug, but it's verbose. Protobuf encodes the same data in 3–10x fewer bytes and is faster to serialize/deserialize. The right choice depends on your traffic patterns, but the choice should be conscious.

"The network is secure."

Reality: Internal networks are not trusted. Traffic can be intercepted.

For most of computing history, the security model was simple: put a firewall at the perimeter. Traffic outside the perimeter is untrusted; traffic inside it is trusted. This is called a "castle and moat" model. Get inside the castle walls, and you can walk freely.

This model has a catastrophic flaw: once an attacker gets past the perimeter — through a phishing attack, a compromised credential, a vulnerable public-facing service — they have relatively free movement within the internal network. Every breach investigation at large companies follows a similar pattern: initial foothold in one low-value service, lateral movement to higher-value services, exfiltration. The internal network was "trusted," so the attacker moved freely.

The Target Breach — 2013

Attackers gained access to Target's network via a third-party HVAC vendor that had network access for remote monitoring. Once inside, they moved laterally to the payment systems. The internal network trusted the vendor's access implicitly. 40 million credit card numbers were stolen. The perimeter was breached through a non-obvious entry point, and nothing inside stopped the movement.

The modern answer is zero trust networking: assume the internal network is hostile. Every service, when receiving a request from another internal service, must authenticate and authorize that request. "It's coming from inside the datacenter" is not a reason to trust it.

In practice this means:

mTLS (mutual TLS) between services — both sides of a connection prove their identity with certificates
Service identity — each service has a cryptographic identity (frameworks like SPIFFE/SPIRE manage this at scale)
Encryption in transit — all traffic between services is encrypted, even on internal networks
Authorization at every service — "can service A call endpoint X on service B?" is checked, not assumed

This adds complexity and latency (TLS handshakes cost time). But the alternative — trusting internal traffic — means a single compromised service can reach everything.

The Blast Radius Principle

When thinking about security in a distributed system, always ask: if this service is compromised, what can an attacker reach? A well-designed zero-trust system minimizes the blast radius of any single compromise. A service that handles email should not be able to call the payment API. Authorization policies make this explicit and enforced, not just hoped for.

"Topology doesn't change."

Reality: Nodes come and go. IPs change. Services move.

Topology is the arrangement of your systems — which servers exist, where they are, what IPs they have, which services run where. In a static world, you set this up once and it stays fixed. In practice, topology changes constantly.

Deploys replace old instances with new ones at different IPs. Auto-scaling adds and removes nodes based on load. Failures cause failovers — a primary database goes down and a replica takes over, often with a new address. Cloud providers migrate VMs to different physical hosts during maintenance windows. Kubernetes reschedules pods when nodes go down.

Code that violates this fallacy looks like:

Hardcoded IP addresses in config files
DNS lookups cached indefinitely (Java's JVM, famously, caches DNS forever by default)
Database connection pools that don't handle connection drops and reconnect
Service discovery data that isn't health-checked, pointing to instances that no longer exist

The Java DNS Caching Bug

Java's InetAddress class, by default, caches DNS lookups permanently for the lifetime of the JVM. This was a security feature — to prevent DNS spoofing attacks. But in cloud environments where IP addresses change, it means a Java service that resolves a database hostname at startup will keep trying to connect to the original IP even after a failover sends that hostname to a new address. The service appears healthy (no errors starting up) but cannot actually reach the database. The fix is setting networkaddress.cache.ttl to a small value like 60 seconds.

The design principle here is to build clients that treat topology as ephemeral. Use DNS or a service registry rather than hardcoded IPs. Use health checks so that stale entries get removed. Build connection management that can detect a dead connection and re-establish it, rather than holding onto a broken socket indefinitely.

More broadly, think of your infrastructure as a fleet rather than a set of fixed machines. Any individual machine can go away at any time. Your system should remain functional when it does. This is the philosophy behind immutable infrastructure: instead of updating servers in place (patching, upgrading), you replace them with fresh instances. This forces your architecture to handle topology change correctly — if it doesn't, replacements will break things and you'll find out immediately, rather than during an emergency.

"There is one administrator."

Reality: Many teams own different parts of your system. Nobody has the full picture.

In a small system, one person can understand and manage the whole thing. They know where every service is, what every config file says, what changes were made last week. As systems grow, ownership fragments. A dozen teams each own a service. Infrastructure is managed by a platform team. Networking is managed by another team. Security policies are set by a security team. Nobody owns the whole picture.

This creates a specific class of problems called configuration drift. Team A upgrades their TLS version to 1.3. Team B's service still only supports TLS 1.2. They talked to each other at architecture review six months ago and agreed on a plan. But the plan wasn't enforced — it was just agreed on. The upgrade rolled out for Team A. Team B got distracted by a product launch. Now there's a mismatch in production that nobody notices until a service-to-service call silently fails.

The Coordination Tax

Every cross-team dependency that relies on verbal agreement or a shared doc rather than enforced configuration is a coordination tax waiting to be collected. The tax comes due at the worst time — during an incident, when two teams are debugging why a call is failing and neither knows the other changed something three weeks ago.

The engineering answer to this fallacy is policy-as-code: the configuration and rules for how your systems operate are written in code, checked into version control, and enforced automatically. Tools like OPA (Open Policy Agent), service meshes (Istio, Linkerd), and infrastructure-as-code (Terraform, Pulumi) exist specifically to make infrastructure state explicit, versioned, and reviewable — like application code.

At the human level, this fallacy is a reminder that coordination between teams is a first-class engineering problem. API contracts between services need to be versioned and tested. Dependencies need to be explicit. Runbooks need to account for the fact that the person running them at 2am may not be the person who wrote the service.

"Transport cost is zero."

Reality: Moving bytes across a network costs time, CPU, and real money.

This fallacy has two distinct components that are often conflated: the computational cost of transport (serialization, protocol overhead, encryption) and the financial cost of transport (what cloud providers charge to move data).

The computational cost is real and measurable. Serializing a request, establishing a connection, encrypting with TLS, and deserializing the response all consume CPU cycles. A naive benchmark will show these costs are small per call. They become visible only when you multiply by millions of calls per hour. Teams have profiled services and found that 20–30% of CPU time was spent in serialization and network I/O overhead — not in their business logic.

The AWS Data Transfer Bill

AWS charges $0.09 per GB for data leaving a region (egress pricing as of 2024). This sounds trivial. But a service processing 10TB of data per day that moves that data across regions is paying $900/day — $330,000/year — just in data transfer. Teams building multi-region architectures have been blindsided by this. The fix is to be deliberate about what data crosses region boundaries, and when possible, keep data processing close to where the data lives.

The financial cost has an architectural implication that's easy to miss: where you place your services relative to your data matters. An architecture where Service A writes data to a database in Region 1, Service B in Region 2 reads that data, processes it, and writes the result back to Region 1 is paying egress costs twice, for every operation. This isn't just inefficient — at scale, it's prohibitively expensive.

Beyond cloud billing, there's the opportunity cost of transport. Every millisecond spent moving data between services is a millisecond not spent computing. Architectures that co-locate services and data — running processing close to storage — consistently outperform architectures that treat the network as free.

"The network is homogeneous."

Reality: Different systems speak different versions of different protocols with different assumptions.

Homogeneous means "all the same." The fallacy is that you can assume all the systems you're talking to share the same platform, OS, hardware, encoding, and protocol version. In any real distributed system, they don't.

Your Go service sends a timestamp as a Unix millisecond integer. The Python service on the other end expects ISO 8601. Your service uses little-endian byte order; the embedded system you're integrating with uses big-endian. Your JSON encoder writes null for missing fields; the receiver's parser treats null and absent as different things. Your service is still on TLS 1.2; the new service you're calling requires TLS 1.3.

Silent Failures Are Worse Than Loud Ones

The most dangerous consequence of the heterogeneity fallacy is silent data corruption. If your service serializes a timestamp incorrectly, you may not get an error — you get wrong data stored silently, processed silently, and reported to users silently. This is worse than a crash. A crash creates an alert. Wrong data creates a slow-burning crisis that surfaces weeks later in a user complaint or a data audit.

The practical responses are:

Use schema-enforced serialization formats (Protobuf, Avro) rather than free-form JSON wherever possible. Schemas make the contract explicit and machine-checkable.
Test integrations, not just units. Your service may work perfectly in isolation. Integration tests with real downstream systems catch the encoding mismatches that unit tests miss.
Be explicit about types. When you pass a timestamp, specify the format, timezone, and precision. Don't rely on "everyone knows" conventions.
Version your APIs. When your encoding or contract changes, give the old format a version and support both for a migration window rather than assuming all callers update simultaneously.

The Postel Principle — and Its Limits

Jon Postel's robustness principle says: "Be conservative in what you send, be liberal in what you accept." For a long time this was considered good practice for handling heterogeneous systems. The problem is that being too liberal in what you accept creates ambiguity — if you accept both a string and an integer for a field, callers become confused about which is correct, and you end up with mixed data in your store. The modern view is to be strict about your own output, but document clearly what variations you accept and why.

Why Are These Still Killing Projects?

Given that these were written down thirty years ago, why do they still cause problems? There are three reasons.

First, distributed systems are hard to develop locally. On your laptop, everything runs on a single machine. The network hop to a "remote" service is to localhost. Latency is negligible, failures don't happen, topology is fixed, and everyone is using the same OS. The fallacies don't manifest. You write and test code in an environment that makes all eight assumptions true. Then you deploy to production where none of them are.

Second, the cost is deferred. Code that ignores the network being unreliable works fine 99.9% of the time. You ship, it works, it gets used, the team moves on. Six months later, under higher load, the 0.1% case starts happening often enough to notice. By then the original author may be on a different team. The cost of the assumption is real but it arrives late and diffusely, making it hard to connect to the original design decision.

Third, each fallacy has a "tax" for handling it correctly. Handling unreliable networks means writing retry logic, idempotency keys, and timeout code. That's more work than not writing it. Under schedule pressure, the extra work feels optional because the system works without it — in testing. The tax is hidden until it's due.

The Practical Takeaway

Every time you write a network call, ask: what happens if this call takes 10 seconds? What happens if it never returns? What happens if it succeeds but the response is lost? What happens if it's called twice? These four questions surface the implications of the first four fallacies, and answering them — even just in comments — changes how you write the code.

Key Principle — Chapter 1

Knowing the fallacies is not enough. You must design as if each one is true, every time you write code that touches a network. The gap between knowing and designing is where most distributed systems bugs live.

Most Common Mistake

Writing network calls without timeouts — or setting them to arbitrarily large values "to be safe" — then being surprised when a slow downstream service causes threads to pile up and brings down the caller.
Trusting that internal services are secure because they're "inside the firewall" — skipping auth and encryption on internal calls.
Hardcoding IPs or relying on DNS caches that don't respect TTL, then getting paged at 2am when a deploy changes an address.

3 Questions for Your Next Design Review

What is the timeout on every network call in this design, and how was it calculated?
If any one service call in this flow fails, is the operation retried safely? Is the downstream handler idempotent?
If a node in this design is replaced (new deploy, autoscaling, failover) — does the rest of the system notice and adapt, or does something break silently?

Back to Table of Contents

Next Chapter The Three Fundamental Tensions