Chapter 22 Part V: Maintainability

Testing Distributed Systems

Unit tests pass. Production burns. Here's the gap, and how to close it.

What's in this chapter

Testing distributed systems is qualitatively different from testing a single process. The bugs that matter most — race conditions, partial failures, network splits, clock skew — don't show up in unit tests at all. This chapter walks through a complete testing strategy, from the humble unit test to testing production itself with chaos engineering.

Why unit tests have a fundamental blind spot when it comes to distributed systems bugs
How property-based testing lets you explore a much larger space of inputs than you could write by hand
The difference between integration tests and contract tests — and what each one catches
Deterministic simulation testing: how FoundationDB found bugs that would take years to appear in production
When chaos testing in production is responsible engineering, and when it's just reckless
A concrete testing pyramid that maps test types to the bugs they catch

Key Learnings

If you only have five minutes, read these.

⚡

Unit tests don't catch distributed bugs

Race conditions, partial failures, and message reordering require multiple processes and time — things unit tests don't have.

🎲

Property-based testing finds the edge cases you forgot to imagine

You describe invariants that must always hold; a framework generates thousands of inputs trying to break them.

🤝

Contract tests let teams move independently without breaking each other

The consumer defines what it needs from a provider. The provider verifies it satisfies those needs. No big-bang integration needed.

🔬

Deterministic simulation is the gold standard

Run your system in a single process with a fake network and fake clock. Replay any failing scenario deterministically. FoundationDB's entire distributed system is tested this way.

💥

Chaos testing verifies your assumptions about failure

You think you handle network partitions gracefully. Chaos testing checks whether that's actually true — in real production conditions, not a staging environment that behaves differently.

🗺️

Map each test type to the bug class it catches

Don't just "add more tests." Know which type of test catches which type of bug. Unit tests for logic, simulation for timing, chaos for operational assumptions.

Why Distributed Systems Are Hard to Test

Here's a situation every engineer working on distributed systems has been in: the test suite is green, the code review is approved, the staging environment looks fine — and then something breaks in production in a way nobody anticipated.

It's not that the tests were badly written. It's that they were testing the wrong things. Most testing practice was developed for single-process programs, where the main sources of bugs are logic errors and bad inputs. In a distributed system, the most damaging bugs come from somewhere else entirely:

Two messages arrive in an order that's possible on the network but that you never tested
A node crashes after writing to disk but before sending the acknowledgment
A network partition isolates two nodes that both believe they are the leader
A slow garbage collection pause makes a heartbeat timeout, triggering a failover that didn't need to happen
A downstream service responds with an old version of a response after a deploy

None of these bugs show up in a unit test. They require multiple processes, real or simulated time, and the ability to introduce failures at exactly the right moment. That's a different testing problem.

The bugs that matter most in distributed systems are the ones that only appear when something goes wrong at exactly the wrong time.

This chapter builds a testing strategy from the ground up, matching each technique to the class of bugs it's actually good at catching. We'll start with what unit tests can and cannot do, build up through property-based and contract testing, and end with the most powerful approach most teams don't use: deterministic simulation.

The Unit Test Gap

Unit tests are valuable. They run fast, they give immediate feedback, they document how a function is supposed to behave. None of that is in question. The problem is that engineers often rely on them as their primary safety net for distributed systems — and they provide very limited safety for that purpose.

What unit tests miss

A unit test runs one function or one class in isolation, with all its dependencies mocked. This approach is fine for testing business logic. It breaks down for distributed systems because the bugs don't live in individual functions — they live in the interactions between processes over time.

Consider a simple leader election system. You might have a well-tested electLeader() function that correctly picks the node with the highest ID. Unit tests pass. But the bug might be: what happens when the current leader receives a vote request from a candidate that it doesn't know is behind it? That question requires two nodes running simultaneously and a specific message ordering that you didn't happen to test. The unit test for electLeader() tells you nothing about this.

Common Mistake

Mocking the network in unit tests gives you a false sense of coverage. A mock that always responds immediately with the right answer is testing a network that doesn't exist. The real network drops messages, reorders them, and delivers them after arbitrary delays.

The specific things unit tests structurally cannot test:

Message ordering. Real networks don't guarantee FIFO delivery. Any bug that only appears with out-of-order messages is invisible to unit tests.
Partial failure. When a node crashes mid-operation — after writing to disk but before responding — the system is in a partially updated state. Unit tests don't model crashes at all.
Concurrency bugs. A race condition between two goroutines or threads requires timing that unit tests can't reliably reproduce.
Emergent behavior. Properties that emerge from multiple components interacting — like whether your system is linearizable — can't be verified by testing components individually.
Cascading failures. One service timing out causes another to retry, which causes a third to become overloaded. This cascade only exists when the services are actually connected.

This doesn't mean write fewer unit tests. It means understand what they cover and know you need other tools for the rest.

Property-Based Testing

In a normal test, you write an example: "given this specific input, expect this specific output." The problem is that you can only test inputs you thought of. Bugs often live in the inputs you didn't think of.

Property-based testing turns this around. Instead of writing examples, you describe properties that must always be true, and a framework generates hundreds or thousands of random inputs trying to find a case where the property fails.

The classic example: if you're testing a sorting function, instead of writing "sort([3, 1, 2]) should return [1, 2, 3]", you write the property: "for any list of integers, the output should be sorted and have the same elements as the input." The framework then hammers your function with random lists — empty lists, lists with duplicates, lists with negative numbers, very large lists — until it either finds a failure or gives up.

// Property-based test for a key-value store using Hypothesis (Python)
from hypothesis import given, strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant

class KVStoreStateMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.model = {}        # ground truth: a simple dict
        self.store = KVStore()  # system under test

    # Rules are operations the framework can apply in any order
    @rule(key=st.text(), value=st.text())
    def put(self, key, value):
        self.model[key] = value
        self.store.put(key, value)

    @rule(key=st.text())
    def delete(self, key):
        self.model.pop(key, None)
        self.store.delete(key)

    # Invariant: checked after every sequence of operations
    @invariant()
    def store_matches_model(self):
        for key, value in self.model.items():
            assert self.store.get(key) == value

The framework runs thousands of sequences of put and delete operations in random orders, checking after each operation that the real store matches the simple dictionary model. If it ever finds a sequence that causes a mismatch, it shrinks the sequence to the smallest failing example — usually just a few operations — and reports it.

Testing stateful systems

For distributed systems, the most useful flavor of property-based testing is stateful property testing. You define a simple reference model — often just a Python dict or a list — that represents what the system should be doing, and then you compare the real system's behavior against the model under random operation sequences.

This is powerful because the model is simple enough to be obviously correct, but the system under test can be arbitrarily complex. If they ever disagree, you've found a bug.

Good properties to test in distributed systems:

Read-your-writes: after a successful write, a read by the same client should always return that value or a more recent one
Monotonic reads: if you read value V, a subsequent read should never return a value older than V
Idempotency: performing the same operation twice should leave the system in the same state as performing it once
Commutativity: for operations that should be order-independent (like a counter increment), the final state shouldn't depend on the order they were applied

Key Insight

The hardest part of property-based testing is not writing the test framework code — it's figuring out what properties your system should have. If you can't articulate the invariants of your system clearly enough to write them as code, you probably don't understand your system well enough yet. Writing properties is a design tool, not just a testing tool.

Popular property-based testing libraries: Hypothesis (Python), QuickCheck (Haskell, the original), fast-check (TypeScript/JavaScript), jqwik (Java).

Integration Tests vs. Contract Tests

Once you have multiple services talking to each other, the natural instinct is to write integration tests: spin up all the services together and test them end-to-end. This works, but it has costs that compound as the system grows.

Integration tests against real services are slow, flaky, and hard to run locally. More importantly, they don't scale well to microservice architectures. If Service A depends on Services B, C, and D, a full integration test for A requires all four services to be healthy. When the test fails, you don't know which service caused it. And when one service is being refactored, all integration tests that touch it break.

Consumer-driven contract testing

Contract testing solves this by separating the question: "Does Service A correctly use Service B's API?" into two smaller questions:

Does Service A correctly call the API it expects? (verified by A's tests)
Does Service B correctly fulfill the expectations that A has? (verified by B's tests)

The "contract" is the shared artifact between these two questions. Consumer (Service A) writes tests that record what they expect from the provider (Service B). Those expectations become a contract file. The provider (Service B) runs tests that verify it satisfies every contract from every consumer.

Consumer (Service A) Provider (Service B) │ │ │ 1. Write consumer test │ │ "I expect POST /orders to return │ │ { id, status } with 201" │ │ │ │ 2. Test runs against a mock provider │ │ ┌─────────────────────────────┐ │ │ │ Mock records the interaction│ │ │ └─────────────────────────────┘ │ │ │ │ 3. Interaction saved as contract │ │ ─────────────────────────────────────► │ │ │ 4. Provider test loads contract │ │ "Can I satisfy all consumers?" │ │ ┌─────────────────────────────┐ │ │ │ Run each interaction against │ │ │ │ the real provider code │ │ │ └─────────────────────────────┘ │ │ │ ✓ Contract verified │ or │ ✗ Breaking change detected

Pact in practice

Pact is the most widely used contract testing framework. The workflow looks like this:

// Consumer test (Service A, using Pact-JS)
describe('OrderService consumer', () => {
  const provider = new PactV3({
    consumer: 'OrderService',
    provider: 'PaymentService',
  });

  it('charges a valid card', () => {
    provider
      .given('card 4111111111111111 is valid')
      .uponReceiving('a charge request')
      .withRequest({
        method: 'POST',
        path: '/charges',
        body: { cardNumber: '4111111111111111', amount: 5000 }
      })
      .willRespondWith({
        status: 201,
        body: { chargeId: like('ch_abc'), status: 'succeeded' }
        // like() means: match the type/shape, not the exact value
      });

    return provider.executeTest(async (mockServer) => {
      const client = new PaymentClient(mockServer.url);
      const result = await client.charge({ cardNumber: '4111111111111111', amount: 5000 });
      expect(result.status).toEqual('succeeded');
    });
  });
});

This test runs against a mock and produces a contract file. The PaymentService CI then runs this contract file against the real PaymentService code to verify it still satisfies the expectation. If a PaymentService developer renames the status field to chargeStatus, the contract test catches it immediately — before any integration test fails, and before any consumer is deployed.

The key distinction between integration tests and contract tests:

Dimension	Integration Tests	Contract Tests
What they catch	End-to-end correctness, emergent behavior	API compatibility between two specific services
Speed	Slow (real services must run)	Fast (one side uses a mock)
Flakiness	High (network, timing, service health)	Low (deterministic)
Pinpoints the failure	No — which service broke?	Yes — exactly which consumer expectation
Teams can work independently	No — requires everyone available	Yes — async via contract broker
Tests real-world failure modes	Partially	No (happy path only)

The right answer is both: contract tests give you fast, reliable API compatibility checks; integration tests give you end-to-end confidence on the critical happy paths. But if you're only doing integration tests, consider whether you'd be better served by more contract tests and fewer, smaller integration test suites.

Deterministic Simulation Testing

This is the most powerful technique in this chapter, and the one most teams don't use. It's worth understanding deeply.

The core problem with testing distributed systems is non-determinism. When you run your system in a real environment, two runs of the same test might behave differently because of timing differences, message reordering, or random failures. This makes bugs hard to reproduce and hard to reason about.

Deterministic simulation testing eliminates this non-determinism entirely. The idea is:

Run your entire distributed system — all nodes — inside a single process
Replace the real network with a fake one you control completely
Replace the real clock with a fake one you can advance manually
Use a seeded random number generator for all "random" decisions
Inject faults at deterministic moments controlled by a test script

With this setup, every test run is perfectly reproducible. If a test fails with seed 12345, it will fail with seed 12345 every time. You can replay the exact sequence of events that caused the failure, add logging, step through it in a debugger.

How FoundationDB built this

FoundationDB, the distributed database acquired by Apple, built their entire testing infrastructure around deterministic simulation. Their engineers have described finding bugs that would appear in production roughly once every thousand years of machine time — and finding them in test runs that take minutes, by simulating years of operation in compressed time.

The secret is that FoundationDB was designed from the start to be simulatable. Their actor-based concurrency model made it possible to swap out the real network and real clock for simulated versions without changing any of the distributed logic code. Crucially, the simulated network could introduce faults — drop messages, reorder them, duplicate them, partition nodes — and the system had to handle all of it correctly.

The Key Design Decision

FoundationDB's simulation works because they treated simulatability as a first-class design requirement, not an afterthought. The network layer is behind an abstraction. The clock is behind an abstraction. Any "system call" that would make the code non-deterministic is behind an abstraction that can be swapped for a fake. If you design your system this way from the beginning, simulation testing is accessible. If you don't, retrofitting it is very hard.

Building simulation capabilities into your system

You don't have to build FoundationDB-grade simulation to benefit from this approach. Even a modest version is valuable. Here's how to think about it:

Step 1: Make the network injectable. Your nodes should communicate through an abstraction — an interface or trait — rather than directly calling socket APIs. In production, this abstraction wraps TCP. In tests, it wraps a fake message bus you control.

// Go: abstract the network transport
type Transport interface {
    Send(to NodeID, msg Message) error
    Receive() <-chan Envelope
}

// Production: real TCP
type TCPTransport struct { ... }

// Testing: fake network with controllable faults
type SimNetwork struct {
    nodes       map[NodeID]*Node
    partitioned map[NodeID]bool
    latency     time.Duration
    dropRate    float64
    rng         *rand.Rand  // seeded for determinism
}

func (n *SimNetwork) Send(to NodeID, msg Message) error {
    if n.partitioned[to] { return errors.New("partition") }
    if n.rng.Float64() < n.dropRate { return nil } // silently drop
    // schedule delivery after simulated latency
    n.schedule(n.clock.Now().Add(n.latency), func() {
        n.nodes[to].deliver(msg)
    })
    return nil
}

Step 2: Make the clock injectable. Never call time.Now() or System.currentTimeMillis() directly from distributed logic. Go through an interface. The simulated clock can jump forward, skip ahead to the next scheduled event, or move in lockstep with the simulation.

Step 3: Write scenario scripts. A test scenario is a sequence of events: start three nodes, elect a leader, inject a network partition between nodes 1 and 2, try to write a value, verify the write either succeeds or returns a clear error, heal the partition, verify the system recovers.

// A simulation test scenario for leader election under partition
func TestLeaderElectionSurvivesPartition(t *testing.T) {
    sim := NewSimulation(/* seed: */ 42)

    nodes := sim.StartCluster(3)
    sim.AdvanceUntil(leaderElected)

    leader := sim.CurrentLeader()
    follower := sim.PickFollower()

    // Isolate leader from one follower
    sim.Partition(leader, follower)
    sim.Advance(2 * electionTimeout)

    // A new leader should have been elected among the majority
    newLeader := sim.CurrentLeader()
    if newLeader == leader {
        t.Fatal("old leader still in charge despite partition")
    }

    // Heal and verify convergence
    sim.HealPartition(leader, follower)
    sim.AdvanceUntil(clusterConverged)

    if !sim.AllAgreeOnLeader() {
        t.Fatal("cluster did not converge after partition healed")
    }
}

Step 4: Run with many seeds. One simulation test with 1000 different seeds explores 1000 different interleavings of messages and failures. This is orders of magnitude more coverage than you could achieve with manual test cases.

Why This Finds Bugs Others Miss

A simulation can inject a crash at every possible point in a code path. A message drop between steps 3 and 4 of a 10-step protocol. A clock skew that makes a heartbeat arrive 1ms too late. These aren't scenarios you'd think to write by hand — but a simulator running thousands of random variations will find them. The bugs it catches are the ones that appear in production once every six months and cause an incident.

Chaos Testing in Production

Even the best simulation can't fully replicate production. The actual hardware, the actual traffic patterns, the actual load, the interactions between dozens of services you don't control — these can only be tested by running in production. Chaos engineering is the practice of deliberately introducing failures into production to verify that your systems handle them correctly.

The original chaos engineering work came from Netflix's Chaos Monkey, which randomly terminated EC2 instances in production. The insight was simple: if your system is supposed to be resilient to instance failures, you should verify that it actually is — not just in theory, but under real conditions, with real traffic.

When chaos testing is responsible

Chaos engineering is often misunderstood as "break things randomly and see what happens." Done that way, it's reckless. Done correctly, it's a rigorous experimental method.

Before running a chaos experiment, you need four things:

A steady-state hypothesis. What does "normal" look like? Define it in terms of measurable metrics: p99 latency under 200ms, error rate below 0.1%, order completion rate above 99.9%. Write this down before the experiment.
A specific failure to inject. Not "random failures." A specific, controlled failure: "we will terminate one instance in the payments service in us-east-1."
A way to measure the impact. Have your dashboards open before you start. Know which metrics to watch.
A way to stop immediately. If things go wrong faster than expected, you need a kill switch.

The experiment then has a simple structure:

Measure steady-state metrics
Inject the failure
Measure whether metrics deviate from steady state
Restore normal conditions
Document what you learned

If the metrics don't deviate, you've gained confidence that your system handles that failure class correctly. If they do deviate, you've found a real gap in your resilience — before a real failure found it for you.

Before You Run Chaos in Production

Start in a non-production environment first, even if it's less realistic. Build your chaos tooling. Practice the experiment. Understand what "normal" looks like. Only move to production when you can answer: what's the worst that can happen, and are we prepared for it? Chaos engineering is not appropriate for a service with no on-call coverage, no dashboards, and no runbooks.

Game days: structured chaos

A game day is a planned event where a team deliberately runs failure scenarios against their system, with the whole team present and watching. It's more structured than continuous chaos testing — it's a periodic drill.

A good game day scenario has a narrative: "An AZ goes down at 2pm on a Tuesday. What happens? Walk through what alerts fire, what the on-call person sees, what they do. Did it go as expected? What surprised you?"

Game days are valuable for testing things that chaos tooling can't easily automate:

Human response time — does the alert fire fast enough to be useful?
Runbook quality — is the runbook up to date and clear enough to follow under stress?
Coordination — do the right people know to talk to each other?
Recovery procedures — have you practiced the actual steps to restore service?

The goal is not to find bugs (though you will). The goal is to build the team's operational muscle memory so that when a real incident happens, the response is practiced and calm rather than chaotic and improvised.

Putting It All Together: A Testing Strategy

Each test type catches a different class of bug. The table below maps them explicitly:

Test Type	Bugs It Catches Well	Bugs It Misses	Cost
Unit tests	Logic errors, bad inputs, edge cases in algorithms	Timing, ordering, partial failure, concurrency	Very low
Property-based tests	Invariant violations, unexpected input combinations	Multi-node interactions, real failure modes	Low
Contract tests	API compatibility breaks between services	Business logic, failure handling, latency	Low–Medium
Integration tests	End-to-end happy path, service wiring	Failure edge cases, high load, real timing	Medium–High
Simulation tests	Timing bugs, message reordering, crash recovery, protocol correctness	Real hardware issues, traffic patterns, external dependencies	High (setup), Low (per test)
Chaos / game days	Operational assumptions, recovery procedures, cascading failures	Logic bugs, API compatibility	High

A practical testing strategy layers these techniques by risk and investment:

Confidence Pyramid ▲ │ Chaos / Game Days (quarterly) │ Test operational assumptions │ in real production conditions │ │ Simulation Tests (per feature) │ Test correctness of distributed │ protocols under injected faults │ │ Integration + Contract Tests (per PR) │ Test service wiring and │ API compatibility │ │ Property-Based + Unit Tests (always, per function) │ Test logic, invariants, └───────────────────────────── edge cases

A few practical notes on adopting this:

Don't try to add all layers at once. Start with strong unit and property-based tests. Add contract tests when you have more than two services. Add simulation testing when you're building or maintaining a stateful distributed protocol. Add chaos engineering when you have dashboards, runbooks, and on-call coverage in place.
Test flakiness is a first-class bug. A test that passes 95% of the time is not passing — it's hiding a non-determinism problem that will eventually cause a real incident. Track flaky tests, fix them, and don't let them accumulate.
Your test environments lie to you. Staging environments almost never have realistic traffic patterns, realistic data volumes, or realistic failure rates. Be conscious of what your staging tests are actually proving — and what they can't prove about production behavior.
Test the observability too. During a game day or chaos experiment, check that the right alerts fire. A system that handles a failure gracefully but doesn't alert is still dangerous, because no human knows something went wrong.

The best time to find a distributed systems bug is in a simulator at 3pm. The worst time is during an on-call incident at 3am. Your testing strategy is the difference between these two outcomes.

Chapter Summary

The key principle

Unit tests verify logic. Simulation tests verify distributed protocol correctness. Chaos tests verify operational assumptions. Use each for what it's actually good at.

The most common mistake

Treating a green unit test suite as safety coverage for a distributed system. It is not. The bugs that cause incidents are timing, ordering, and failure bugs that unit tests structurally cannot catch.

Three questions for your next design review

Which of our design assumptions would a network partition invalidate — and do we have a test that simulates a partition?
If service X deploys a breaking API change today, how long before we know? (Contract tests answer this in minutes.)
Have we ever practiced the recovery procedure for our most likely failure mode, end-to-end, with the actual on-call team?

← Previous Ch 21 — Documentation as a System Property

All Chapters

Next → Ch 23 — API Design as a Distributed Systems Problem