Chapter 20 Part V — Maintainability

Service Boundaries —
The Decision That's Hardest to Undo

Where you draw the lines determines how your system ages.

You can rename a variable in five minutes. You can rewrite a function in an afternoon. But changing where one service ends and another begins — that takes months, breaks teams, and often doesn't get done at all. Service boundaries are the most permanent decisions in a distributed system. This chapter is about how to make them well the first time, and what to do when you didn't.

What's in this chapter

Key Learnings

Short on time? Read these. Come back for the depth later.

01 A good service boundary means you can change one service without touching another. If changing service A routinely requires changing service B, your boundary is wrong.
02 Conway's Law is not just a warning — it's a design tool. The org structure you have will produce a matching architecture. If you want a different architecture, change the org first.
03 The Bounded Context from Domain-Driven Design is the most useful concept for finding service boundaries. One service = one bounded context. Don't let two contexts share a database.
04 A distributed monolith is worse than a regular monolith. It has the operational complexity of microservices and the tight coupling of a monolith. Teams build it all the time by accident.
05 Data ownership is the real service boundary. If two services can write to the same table, you don't have two services — you have one system with a split codebase.
06 Cross-service transactions don't exist in the ACID sense. Your options are: avoid them by design, use sagas, or accept eventual consistency. There is no fourth option.
07 Most teams with fewer than 50 engineers should not be running microservices. Start with a well-structured monolith. Split when you have a specific reason, not because it "feels right".
08 The modular monolith is an underrated middle ground. Strong internal module boundaries, single deployment unit. You can split later — when you actually know where the seams are.

Why Boundaries Are Hard to Change

Think about the last time a team at your company decided to split a service in two, or merge two services into one. How long did it take? Weeks? Months? Did it finish at all?

Service boundaries are sticky in a way that most other decisions aren't. Once a service is running in production, other services start calling it. Data gets stored in its database. Teams form around it. On-call rotations are set up. Documentation is written. SDK clients get published. And each of these things is a reason why changing the boundary later is expensive.

Compare this to, say, choosing a variable name, or even picking a library. You can change those quietly, in a PR, with no one noticing. Changing a service boundary is a project. Sometimes it's a multi-quarter project with its own design doc, migration plan, and stakeholder sign-off.

This asymmetry — easy to set, hard to change — is the reason service boundaries deserve more thought than most engineering decisions. You will live with this choice for years.

What Makes a Boundary "Good"?

A good service boundary has one defining property: you can change one service without touching another. That's it. If you're routinely opening PRs in two repositories at once to ship a single feature, your boundary is in the wrong place.

There's a more formal way to say this. A good boundary has high cohesion (things that change together are inside the same service) and loose coupling (things that need to change independently are in different services). These two properties pull in opposite directions. Getting both right is the whole challenge.

Insight

High cohesion and loose coupling are not properties of a service in isolation. They're properties of a service in relation to the other services around it. You can't evaluate a boundary without also looking at what's on either side of it.

A secondary property of a good boundary: the service should own its own data. This is so important it gets its own section later. But at a high level — if two services share a database table, they are not truly separate. One can break the other's assumptions silently, through a schema change or a data migration. True independence requires data independence.

Conway's Law: Not a Warning, a Tool

In 1967, Melvin Conway made an observation: "Organizations which design systems are constrained to produce designs which are copies of the communication structures of those organizations."

This became known as Conway's Law. Forty years later, it is still the single most accurate prediction you can make about how a system will evolve. Not because it's inevitable, but because communication overhead shapes decisions. Engineers on the same team talk constantly. Engineers on different teams talk less. So systems end up with clean interfaces where team boundaries are, and messy coupling where they aren't.

Conway's Law in practice
  Organization A:                    System A produces:

  ┌─ Team: Auth ────────┐            ┌─ Auth Service ──────┐
  │  Alice, Bob, Carlos  │    →       │  Clean API surface   │
  └─────────────────────┘            └─────────────────────┘
           │ (weekly sync)                      │ (well-defined)
  ┌─ Team: Payments ────┐            ┌─ Payments Service ──┐
  │  Dana, Eve           │    →       │  Clean API surface   │
  └─────────────────────┘            └─────────────────────┘

  Organization B:

  ┌─ Team: Backend ────────────────────────┐
  │  Alice, Bob, Carlos, Dana, Eve          │    →    ┌─ Monolith ────────────────────┐
  │  (all features, all domains)            │         │  Everything coupled to         │
  └────────────────────────────────────────┘         │  everything else              │
                                                      └───────────────────────────────┘
    

The Inverse Conway Maneuver

If Conway's Law says your org structure produces your architecture, then the implication is: if you want a different architecture, change your org structure first. This is called the Inverse Conway Maneuver, and it's one of the most actionable ideas in this book.

If you want an independent Auth service that can deploy on its own, make sure there's a team that owns Auth end-to-end — writing the code, operating it, and responsible for its reliability. If ownership is split (backend team writes the auth logic, platform team runs the infrastructure, security team reviews the policies), you will get a service with three masters and no clear boundary.

This sounds obvious. In practice, engineering leadership often designs the architecture first and the org second — or doesn't change the org at all and just wonders why the services don't behave independently. You cannot architect your way out of an org structure problem.

Common Mistake

Designing the "target architecture" on a whiteboard, then trying to staff teams around it. The architecture will drift toward your existing org structure regardless. If the two don't match, you'll get a compromise that satisfies neither goal.

Domain-Driven Design and the Bounded Context

Domain-Driven Design (DDD) is a large body of work, and most of it is not directly useful for deciding where to draw service lines. But one concept from DDD is genuinely essential: the Bounded Context.

A Bounded Context is a part of the system where a model — a set of concepts and their relationships — has a consistent, unambiguous meaning. Outside that context, the same word may mean something different.

Here's a concrete example. Consider the word "customer" in a large e-commerce company:

These are all "customers", but they are completely different models. If you build a single Customer service that tries to satisfy all four of these, you get a bloated service that's hard to change, because any change might affect any of the four consumers. The four are separate Bounded Contexts, and each deserves its own service (or at least its own module).

The "Customer" word means different things in different contexts
              ┌────────────────────────────────────────────────────┐
              │  The word "Customer" across bounded contexts       │
              └────────────────────────────────────────────────────┘

  ┌─ Order Context ──────┐   ┌─ Loyalty Context ─────┐   ┌─ Fraud Context ──────┐
  │ Customer {           │   │ Customer {             │   │ Customer {            │
  │   id                 │   │   id                   │   │   id                  │
  │   billing_address    │   │   points_balance        │   │   risk_score          │
  │   payment_method     │   │   tier: "Gold"          │   │   device_fingerprint  │
  │   cart_id            │   │   joined_date           │   │   flagged: bool       │
  │ }                    │   │ }                       │   │ }                     │
  └──────────────────────┘   └────────────────────────┘   └───────────────────────┘

        Same word.           Completely different models.          Different data owners.
    

One Service, One Bounded Context

The practical rule is simple: one service should correspond to one Bounded Context. Not one-to-many (one service trying to serve multiple contexts), and not many-to-one (multiple services sharing a single context and its data store).

How do you find your Bounded Contexts? A few techniques that work:

Event Storming. Gather the team in a room (or a virtual whiteboard). Write down every significant event that happens in the system — things like "order placed", "payment failed", "account suspended". Cluster the events that naturally belong together. The clusters are your Bounded Contexts.

The ubiquitous language test. Talk to people in different parts of the business. When they start using the same word to mean different things, or using different words for the same thing, you've found a context boundary.

The "what breaks when this changes" test. If you change the schema of a data entity, which teams need to know? If the answer is more than one team, that entity is probably spanning multiple contexts and you have a boundary problem.

Practical tip

The right time to identify Bounded Contexts is before you write any code. But if you already have a big system, you can do it retroactively by looking at which parts of the codebase change together. Files that always appear in the same pull request are in the same context, whether or not the architecture reflects that.

The Distributed Monolith — The Worst of Both Worlds

A distributed monolith is a system that has been split into multiple services, but where those services are so tightly coupled that they must be deployed together, tested together, and changed together. It looks like microservices from the outside. It behaves like a monolith. It has the operational overhead of microservices and none of the independence benefits.

Teams build distributed monoliths by accident. They start with good intentions — splitting a large system into smaller pieces — but make the split along the wrong lines. Here are the most common ways it happens:

Shared Database

This is the most common cause. Two services write to the same database. They call each other's tables directly, bypassing any API. Service A adds a column; Service B breaks at runtime because it didn't expect that column. The services look separate in code but are actually one system at the data layer.

Distributed monolith via shared database
  ┌─ Service A ──┐    ┌─ Service B ──┐    ┌─ Service C ──┐
  │  (Orders)    │    │  (Inventory) │    │  (Shipping)  │
  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘
         │                   │                    │
         └───────────────────┼────────────────────┘
                             │
                    ┌────────▼───────┐
                    │   Shared DB    │
                    │  (all tables)  │
                    └────────────────┘

  ← Looks like microservices. Behaves like a monolith.
    Any schema change requires coordinating all three teams.
    

Synchronous Chains

Service A calls Service B calls Service C calls Service D to serve a single user request. This is a synchronous call chain. If Service D is slow, Service A is slow. If Service C is down, the whole request fails. You've distributed failure instead of isolating it.

Deep call chains (more than two or three hops for a synchronous request) are usually a sign that your services are sliced too thin, or that you've created services around technical layers (a "data service", an "orchestration service") rather than around business domains.

Shared Library That Carries Business Logic

Teams extract shared code into a common library. So far, so good. But then that library starts growing business logic — validation rules, pricing calculations, domain models. Now, when the business rules change, every service that uses the library needs to update and redeploy together. You've recreated tight coupling, just through a dependency manager instead of a monorepo.

Warning

Shared libraries are fine for purely technical concerns: logging, HTTP clients, config parsing, retry logic. They are dangerous for business logic. If the library changes require coordinated deploys across services, you've turned a library into a source of coupling.

Chatty Services

Two services that make a high number of synchronous calls to each other are operationally coupled even if they're architecturally separate. They effectively move together — if one goes down or slows, the other degrades. This is usually a sign that the boundary is drawn in the wrong place and these two services should be one.

Data Ownership — The Real Boundary

Here is a simple test for whether a service boundary is real: can the service change its internal data model without asking for permission from any other team?

If the answer is no — if changing a table requires a migration in another service, or a conversation with another team — then the boundary is not real. The services are coupled at the data layer even if they're separated at the code layer.

The Rule

Each piece of data should have exactly one service that owns it. That service is the single writer. It's the source of truth. Other services that need that data have three options:

  1. Ask via API. Call the owning service to read or write. Clean separation, but introduces runtime dependency and latency.
  2. Subscribe to events. The owning service publishes events when data changes. Other services maintain their own local copy of the subset they need. Eventually consistent, but operationally independent.
  3. Accept that you don't own that data. If you find yourself copying the same data into multiple services with multiple writers, that's a sign you need to reconsider the boundaries.
Data ownership: wrong vs. right
  ✗ WRONG — Multiple writers to the same data

  ┌─ Orders ─────┐       ┌─ Inventory ──┐
  │  writes to   │       │  writes to   │
  │  products    │       │  products    │
  │  table       │       │  table       │
  └──────┬───────┘       └──────┬───────┘
         └──────────┬────────────┘
                    ▼
           ┌─ products table ─┐
           │  Who owns this?  │
           │  Anyone? Nobody? │
           └──────────────────┘


  ✓ RIGHT — Single writer, others read via API or events

  ┌─ Catalog Service ────────────────────────┐
  │  owns products table                     │
  │  publishes: ProductUpdated events        │
  └────────────────┬─────────────────────────┘
                   │
         ┌─────────┴──────────┐
         ▼                    ▼
  ┌─ Orders ──────┐    ┌─ Inventory ──┐
  │ reads via API │    │ local cache  │
  │ no direct DB  │    │ from events  │
  └───────────────┘    └──────────────┘
    

The Hidden Cost of Shared Data

When services share data, the most dangerous problems are the ones that don't cause immediate errors. They cause subtle correctness issues.

Imagine Service A and Service B both write to a user preferences table. Service A stores the user's notification settings. Service B stores the user's display settings. A developer on Team B "cleans up" a column that they think is unused. It was the notification frequency column that Service A was reading. No compile error. No test failure. Users stop receiving notifications. The on-call alert fires at 2am.

This is not a hypothetical. It happens in every large system that shares databases across teams. The fix is data ownership, not better coordination. Coordination fails at scale. Ownership doesn't.

Cross-Service Transactions: The Ugly Truth

Here's a situation that comes up constantly: you have two services, and you need to perform an operation that must update data in both, atomically. Either both updates happen, or neither does.

For example: when a user places an order, you need to (1) create the order in the Order service and (2) decrement the inventory count in the Inventory service. These two things must happen together. An order without a matching inventory decrement is a problem. An inventory decrement without an order is also a problem.

In a monolith with a single database, this is easy. You open a transaction, do both updates, commit. If anything fails, the transaction rolls back.

In a distributed system with two separate services and two separate databases, you cannot do this. There is no distributed transaction that gives you the same guarantees without serious costs.

Option 1: Two-Phase Commit (2PC)

Two-phase commit is the classical answer. A coordinator tells both services to "prepare" (get ready to commit), then tells them both to "commit". If any participant fails during prepare, the coordinator tells everyone to roll back.

In theory, this gives you atomicity across services. In practice, it has problems that make most teams avoid it:

Common Mistake

Reaching for distributed transactions when the real fix is rethinking the service boundary. If you frequently need atomic operations across two services, those two things probably belong in the same service.

Option 2: Sagas

A saga is a sequence of local transactions, where each step publishes an event or message that triggers the next step. If any step fails, you run compensating transactions to undo the previous steps.

For the order example:

  1. Order service creates the order in state "pending".
  2. Order service publishes "OrderCreated" event.
  3. Inventory service receives the event, reserves the inventory, publishes "InventoryReserved".
  4. Order service receives "InventoryReserved", transitions order to "confirmed".
  5. If Inventory service fails to reserve (out of stock), it publishes "InventoryFailed".
  6. Order service receives "InventoryFailed", cancels the order — the compensating transaction.
Saga pattern — choreography style
  ┌─ Order Service ──────────────────────────────────────────────┐
  │  1. Create order (pending)                                    │
  │  2. Publish: OrderCreated ────────────────────────────────►  │
  │                                              ┌─ Inventory ─┐ │
  │                                              │  3. Reserve │ │
  │                                              │  inventory  │ │
  │  ◄────────────────── InventoryReserved ───── │  4. Publish │ │
  │  5. Confirm order                            └─────────────┘ │
  └──────────────────────────────────────────────────────────────┘

  Failure path:
  ┌─ Order Service ──────────────────────────────────────────────┐
  │  1. Create order (pending)                                    │
  │  2. Publish: OrderCreated ────────────────────────────────►  │
  │                                              ┌─ Inventory ─┐ │
  │                                              │  Out of     │ │
  │  ◄───────────────────── InventoryFailed ──── │  stock!     │ │
  │  3. Cancel order  ◄─ compensating txn        └─────────────┘ │
  └──────────────────────────────────────────────────────────────┘
    

Sagas work well when you accept that consistency is eventual, not immediate. Between step 2 and step 4, the system is in a temporarily inconsistent state — the order exists but inventory hasn't been reserved yet. That's usually acceptable for business processes, as long as the eventual outcome is always consistent.

Option 3: Redesign the Boundary

If you keep needing cross-service transactions between the same two services, the right answer might be to move those operations into a single service. The need for atomic operations is often a signal that the data you're trying to split actually belongs together.

There's no shame in merging two services. The goal was never "more services." The goal was "independent deployability and clean ownership." If combining two services gives you that better than keeping them separate, combine them.

When Microservices Are the Wrong Choice

There is a certain prestige attached to microservices. Companies with serious engineering organizations — Netflix, Amazon, Uber — run microservices. Therefore, running microservices is a sign of a serious engineering organization.

This reasoning is backwards. Those companies run microservices because they had specific problems — hundreds of engineers stepping on each other's code, deployment bottlenecks, the need for independent scaling — that microservices solved. The problems came first. The architecture was the solution.

If you don't have those problems, microservices are not a solution. They're overhead.

The Real Costs That People Underestimate

Distributed tracing is not optional. When a request touches eight services and something goes wrong, you need distributed tracing to understand what happened. Setting up and maintaining this infrastructure takes real engineering time.

Network failures become part of your everyday programming model. Every service call can fail, time out, or return unexpected results. You need retry logic, circuit breakers, and timeout handling everywhere. This is not complex code, but it is a lot of code — and it needs to be right.

Local development is painful. Running one service locally is easy. Running ten services locally — each with its own dependencies, its own database, its own config — is a logistics problem. Teams spend significant time on this that they wouldn't spend with a monolith.

End-to-end testing is hard. Testing a single service in isolation is easy. Testing that seven services work together correctly requires a shared test environment, which requires coordination, which creates bottlenecks.

You now operate N databases instead of one. Database backups, migrations, capacity planning, failover — multiply all of that by the number of services.

Factor Monolith Microservices
Local development Simple Complex, requires orchestration
Deployment independence All-or-nothing Per-service
Debugging across the system Stack traces are complete Need distributed tracing
Scaling specific components Scale everything Scale only what's hot
Team independence Coordination required Deploy without asking
Operational overhead Low High (N services to operate)
Cross-cutting changes Easy — one codebase Requires coordinating N teams
Fault isolation One bug can bring down all Failures are contained

The Inflection Point

Microservices start paying off when the cost of coordination in a monolith exceeds the operational overhead of distributed services. In practice, this inflection point is usually somewhere around 50-100 engineers actively working in the same codebase, or when specific components have genuinely different scaling requirements (a video encoding service needs very different resources than a user authentication service).

Below that inflection point, the overhead is usually not worth it. A well-structured monolith — where code is organized into clear modules with clean interfaces — can serve a large product for years. Instagram ran as a monolith with a handful of engineers while handling tens of millions of users. Shopify still runs largely as a monolith and processes more transactions than many banks.

The Modular Monolith: An Underrated Middle Ground

A modular monolith is a single deployable unit where the code is organized into clearly separated modules, each with its own public interface. Modules can only communicate through that public interface — not by calling each other's internal functions or accessing each other's tables directly.

This gives you most of the benefits of microservices (clear boundaries, team ownership, separation of concerns) without the operational overhead (separate deployments, network calls, distributed tracing, multiple databases).

Modular monolith structure
  ┌─ Application (single deployment) ────────────────────────────┐
  │                                                               │
  │  ┌─ Orders Module ─────┐    ┌─ Inventory Module ──────────┐  │
  │  │  Internal code      │    │  Internal code               │  │
  │  │  ─────────────      │    │  ─────────────               │  │
  │  │  Public interface:  │───►│  Public interface:           │  │
  │  │  createOrder()      │    │  reserveStock()              │  │
  │  │  cancelOrder()      │◄───│  releaseStock()              │  │
  │  │                     │    │  getStockLevel()             │  │
  │  └─────────────────────┘    └─────────────────────────────-┘  │
  │           │                              │                     │
  │  ┌────────▼──────────────────────────────▼──────────────────┐ │
  │  │  Shared Database (but modules own separate schemas)      │ │
  │  └───────────────────────────────────────────────────────────┘│
  └───────────────────────────────────────────────────────────────┘

  Rule: Module A cannot access Module B's database tables directly.
        It can only call Module B's public interface functions.
    

The crucial discipline: module boundaries must be enforced, not just suggested. If any code can call any other code, you don't have a modular monolith. You have a monolith with aspirations. Use language-level visibility controls, or linting rules, or automated tests that verify no module imports another module's internal packages.

The other benefit of a modular monolith: it's much easier to extract a service from it later. If your modules have clean interfaces and don't share state, extracting one into a standalone service is mostly a matter of putting an HTTP or gRPC boundary where the function call was. The hard work — defining the interface, separating the data — is already done.

Practical tip

Start with a modular monolith. When a specific module needs to deploy independently (because it has a different release cadence, or a different scaling profile, or a different team that owns it), extract it into a service. You'll do it when you have a concrete reason, and you'll do it from a clean starting point.

A Decision Framework

When someone asks "should we use microservices?", here's the set of questions worth working through before answering:

1. How many engineers will be working in this codebase?

Under 20: start with a monolith unless there's a specific technical reason not to. Between 20 and 80: a modular monolith is probably right. Over 80: independent services start to make sense, but only at boundaries where teams are actually independent.

2. Do different parts of the system have genuinely different scaling requirements?

If the answer is no — if the whole system scales together — then having separate services just means you've added network overhead and operational complexity for no scaling benefit.

3. Do different parts have different reliability requirements?

If your video transcoding pipeline going down shouldn't affect your user-facing API, separate services (with the right circuit breakers) give you fault isolation. But if everything needs to be up for anything to work, services don't buy you fault isolation either.

4. Do different teams need to deploy independently?

This is the most compelling operational reason for separate services. If Team A needs to wait for Team B's review to deploy, that's real friction. Services remove that friction — but only if the teams are truly independent (don't share databases, don't need coordinated deploys).

5. Have you already tried a monolith?

If you're designing a new system, start with a monolith and split when you feel the pain. It's much easier to go from monolith to microservices (you can see the seams when they emerge naturally) than from a poorly designed microservice architecture to a better one.

When You Have the Wrong Boundary

Sometimes you realize midway through a project — or midway through a year — that a boundary is wrong. Services that are too chatty, ownership that's unclear, databases that are shared. What do you do?

First, accept that fixing a wrong boundary is a project, not a PR. You're not refactoring code, you're migrating data, changing contracts, and coordinating teams. Give it the planning overhead it deserves.

The Strangler Fig pattern (named after a tree that slowly grows around a host and eventually replaces it) is the standard approach. Instead of rewriting the service and doing a big cutover, you:

  1. Build the new service alongside the existing one.
  2. Route a small percentage of traffic to the new service.
  3. Gradually increase the percentage as confidence builds.
  4. Deprecate the old service once the new one is handling 100% of traffic.

This applies equally to splitting a monolith, merging two services into one, or redefining where the boundary is between existing services.

The migration of data is usually the hardest part. If you're moving data ownership from Service A to Service B, you need to:

This is slow and careful work. Budget for it properly.

Chapter Summary

The Core Principle

  • A boundary is good if you can change one service without touching another
  • Data ownership, not code separation, is what makes a boundary real
  • The org structure you have will produce the architecture you get

The Most Common Mistake

  • Building a distributed monolith by splitting along technical layers, not business domains
  • Letting two services share a database table
  • Choosing microservices before you've felt the pain that justifies them

Three Questions for Your Next Design Review

  • If Team A changes their data model tomorrow, which other teams need to know?
  • Can each service deploy independently today, right now — no coordination required?
  • Where in your system are two services making a high number of synchronous calls to each other?