Chapter 13: Transactions Across Services — Principles of Distributed Systems Design

The Problem: You Can't Do This Anymore

Imagine you're building an e-commerce system. When a customer places an order, three things need to happen:

Create an order record in the Orders database
Deduct the items from inventory in the Inventory database
Charge the customer's payment method via the Payment service

If all three succeed, great. But what if inventory deduction succeeds, then the payment fails? You now have inventory reserved for an order that was never paid for. What if the order is created, the payment goes through, but your server crashes before it can tell the inventory service? Now you've charged the customer for items that are still shown as available.

In a single monolithic application with one database, you would wrap all three in a transaction. The database gives you an atomic unit — either all of it happens, or none of it does. It's one of the best features ever invented for writing correct software.

But in a microservices world, these three operations live in separate services, owned by separate teams, backed by separate databases. There is no shared database to wrap them in. The transaction boundary stops at the edge of each service's database.

This is the problem this chapter is about.

The Core Insight

Distributed transactions are not a technical problem waiting for a clever solution. They are a fundamental tension between two things we want: independent, autonomous services, and operations that span multiple services to look atomic. You can't have both fully. Every solution in this chapter is a different way of accepting that trade-off.

A Quick Recap of ACID

Before we look at what breaks, let's be precise about what ACID gives us inside a single database:

Atomicity — either all operations in the transaction complete, or none do. No partial state.
Consistency — the database moves from one valid state to another. Constraints are always satisfied.
Isolation — concurrent transactions don't see each other's partial state. It's as if they ran one after another.
Durability — once committed, the data survives crashes.

When you move to multiple services, each service can still give you ACID for its own data. But there's no higher-level system giving you ACID across all of them together. That's what we're trying to approximate.

Two-Phase Commit: The Traditional Answer

Two-phase commit (2PC) is the classical protocol for making a transaction atomic across multiple participants. It was invented decades ago, it is still used today, and it comes with well-known problems that are worth understanding in detail.

How it works

The protocol has two roles: a coordinator (usually the service that initiated the transaction) and multiple participants (each service or database involved).

Phase 1 — Prepare Coordinator ──── "Can you commit?" ────► Participant A (Orders DB) Coordinator ──── "Can you commit?" ────► Participant B (Inventory DB) Coordinator ──── "Can you commit?" ────► Participant C (Payment Svc) Participant A ───► "Yes, I'm ready" ────► Coordinator Participant B ───► "Yes, I'm ready" ────► Coordinator Participant C ───► "Yes, I'm ready" ────► Coordinator Phase 2 — Commit All said yes, so: Coordinator ──── "Commit!" ────────────► Participant A Coordinator ──── "Commit!" ────────────► Participant B Coordinator ──── "Commit!" ────────────► Participant C Result: All three commit atomically. If any participant says "No" in Phase 1, the coordinator sends "Abort" to everyone instead. Clean rollback.

Figure 13.1 — The happy path of two-phase commit. In Phase 1, participants vote. In Phase 2, the coordinator issues the final verdict.

Each participant, when it says "Yes, I'm ready", has to guarantee that it can commit — it writes the pending changes to a durable log. It is now locked. It cannot release those locks until it hears back from the coordinator.

The blocking problem

Here is the fatal issue with 2PC. What happens if the coordinator crashes after it receives all the "Yes" votes but before it sends the "Commit" command?

All participants said "Yes" and are now holding locks. Coordinator crashes. 💀 Participant A: "Did we commit? I'm holding locks. I can't release them without knowing the final answer." Participant B: Same situation. Participant C: Same situation. The entire system is blocked until the coordinator recovers. This could be minutes, hours, or forever.

Figure 13.2 — The blocking failure mode of 2PC. A coordinator crash at the wrong moment can leave all participants stuck, holding locks indefinitely.

The participants cannot decide on their own. They can't commit without the coordinator's instruction (because maybe the coordinator decided to abort). They can't abort either (because maybe the coordinator decided to commit and some other participant already did). They are stuck.

This is called the in-doubt period. During this time, rows are locked, other transactions trying to access those rows will block or time out, and your system grinds to a halt.

Why 2PC is avoided at scale

The blocking problem is not just theoretical. Any system using 2PC at high volume will periodically have a coordinator crash. When this happens, all participants hold locks until the coordinator recovers. For systems that handle thousands of transactions per second, even a 30-second coordinator restart can queue up a backlog that takes minutes to drain. This is why companies like Google, Amazon, and Netflix built distributed systems that avoid 2PC entirely for high-throughput paths.

Three-phase commit: a partial fix

Three-phase commit (3PC) adds a "Pre-commit" phase between the vote and the commit, which allows participants to figure out what to do if the coordinator crashes — but only in networks where you can assume message delivery times are bounded. In practice, real networks are asynchronous and have no such guarantee, which is why 3PC is rarely used in production systems. It solves a theoretical problem while creating a more complex system that fails in different ways.

The Saga Pattern: A Different Philosophy

Rather than trying to make distributed operations look atomic by holding locks, the Saga pattern takes the opposite approach: accept that failures will happen, and plan for how to recover from them.

A Saga breaks a long-running business transaction into a sequence of smaller, local transactions. Each step touches one service. If a step fails, you don't roll back like a database transaction — you run a compensating transaction for each step that already succeeded, undoing its effects.

Order Flow — Happy Path Step 1: Create Order ──► Orders DB commits ✓ Step 2: Reserve Inventory ──► Inventory DB commits ✓ Step 3: Charge Payment ──► Payment Svc commits ✓ Step 4: Confirm Order ──► Orders DB commits ✓ Order Flow — Payment Fails Step 1: Create Order ──► Orders DB commits ✓ Step 2: Reserve Inventory ──► Inventory DB commits ✓ Step 3: Charge Payment ──► FAILS Now run compensating transactions, in reverse: Compensate Step 2: Release inventory reservation ✓ Compensate Step 1: Cancel order ✓

Figure 13.3 — A Saga for placing an order. On the happy path, each step commits locally. On failure, compensating transactions undo the completed steps in reverse order.

The key difference from 2PC: there are no global locks. Each step commits immediately and releases its resources. The next step proceeds without waiting for everyone else. This is dramatically more scalable.

What compensating transactions really are

A compensating transaction is not the same as a database rollback. A database rollback is as if the original operation never happened — it's invisible. A compensating transaction is a new operation that logically reverses the effect of a previous one, but the previous one did happen and may have been visible to others in the meantime.

Consider: if you created an order (Step 1) and then need to cancel it (compensating transaction), the order might have been visible to a warehouse worker for 30 seconds. The compensation creates a cancellation record — it doesn't erase the original order from history. This is important to understand when designing your data model.

Semantic Rollback, Not Physical Rollback

Compensating transactions are a business-level concept, not a database concept. "Cancel the order" is very different from "pretend the order never existed." This means you need to think carefully about what "undoing" a step means in your domain — and accept that some operations are hard or impossible to fully compensate (for example, sending an email, or printing a shipping label).

The isolation problem with Sagas

Remember the I in ACID — isolation. Sagas give you atomicity (eventually, through compensation), but they do not give you isolation. While your Saga is running, another request can read the intermediate state of your data.

In our order example: after Step 1 (order created) and Step 2 (inventory reserved), but before Step 3 (payment), another request might read the inventory count and see the items as reserved. If Step 3 then fails and we compensate Step 2 (releasing the reservation), that other request saw a state that no longer exists. This is called a dirty read.

There are patterns to manage this — you can use semantic locks (marking a record as "pending" until the Saga completes), or design your UI to tolerate eventual consistency. But don't kid yourself: this is a real trade-off that you need to design around, not a detail to paper over.

Choreography vs. Orchestration

Once you decide to use Sagas, you have a choice about how the steps are coordinated. There are two fundamentally different approaches.

Choreography: Each service reacts to events

In choreography, there is no central coordinator. Each service listens for events on a message bus and decides what to do next. When the Orders service creates an order, it publishes an OrderCreated event. The Inventory service listens for this event and reserves stock. When that's done, it publishes an InventoryReserved event. The Payment service listens for that and charges the customer. And so on.

[Orders Service] creates order publishes ──► OrderCreated [Inventory Service] listens for OrderCreated reserves stock publishes ──► InventoryReserved [Payment Service] listens for InventoryReserved charges customer publishes ──► PaymentCharged [Orders Service] listens for PaymentCharged confirms order publishes ──► OrderConfirmed No central brain. Each service just does its part and emits the next event.

Figure 13.4 — Choreography. Services react to events without a central coordinator. The flow emerges from the chain of reactions.

Choreography is simple to start with. Each service is autonomous. There's no single point of failure for the coordination itself. Adding a new step is as easy as adding a new listener.

But it has a serious problem as the system grows: the overall flow becomes invisible. No single place in your codebase describes "here is what happens when an order is placed." The logic is distributed across five event handlers in five different services. When something goes wrong — and it will — you'll be tracing events across multiple logs, multiple services, trying to reconstruct what happened.

Orchestration: A central coordinator drives the flow

In orchestration, one service — typically called a Saga orchestrator or process manager — is responsible for driving the entire workflow. It calls each service in turn, tracks progress, and decides what to do if something fails.

[Order Saga Orchestrator] │ ├──► calls Orders Service: "Create order" ✓ ├──► calls Inventory Service: "Reserve stock" ✓ ├──► calls Payment Service: "Charge customer" ✗ FAIL │ │ Failure detected. Run compensations: ├──► calls Inventory Service: "Release reservation" ✓ └──► calls Orders Service: "Cancel order" ✓ The orchestrator knows the full state of the workflow at all times.

Figure 13.5 — Orchestration. A central Saga orchestrator drives each step and handles failures explicitly. The full flow is visible in one place.

Orchestration makes the flow visible and explicit. When something goes wrong, you look at the orchestrator's state to see exactly which step failed. It's much easier to debug, monitor, and reason about.

The downside: the orchestrator is a new moving part in your system. It needs to be reliable and persistent (its state needs to survive crashes). It can also become a coupling point — if your orchestrator needs to know the API of every participant service, changes to those services ripple into the orchestrator.

Dimension	Choreography	Orchestration
Flow visibility	Invisible — spread across services harder	Explicit — defined in one place easier
Coupling	Services only know about events loose	Orchestrator knows all services tighter
Debugging	Trace events across many services hard	Check orchestrator state easy
Single point of failure	No coordinator to fail resilient	Orchestrator must be highly reliable requires care
Adding steps	Add a new listener easy	Update the orchestrator centralized change
Best for	Simple, linear flows with few steps	Complex flows, long-running workflows, many failure cases

In practice, most large systems end up with orchestration for their critical business flows, because being able to see and debug the overall state of a transaction is worth the coupling cost. Choreography tends to work well for simpler, fire-and-forget notification flows.

Idempotency: The Foundation Everything Rests On

Whether you use Sagas, 2PC, or anything else — the entire correctness of your distributed transaction system depends on one property: idempotency.

An operation is idempotent if running it multiple times with the same input has the same effect as running it once. The canonical example is setting a value: SET balance = 100 is idempotent. Running it ten times gives the same result as running it once. In contrast, ADD 10 TO balance is not idempotent — running it ten times gives a very different result.

Why does this matter so much? Because in distributed systems, you cannot reliably know if an operation was executed. A request goes out; the network times out. Did the other service receive it and process it before the timeout? Or did it never arrive? You don't know. So you retry. If the operation is not idempotent, that retry might double-charge the customer, reserve inventory twice, or create two orders.

Idempotency keys

The standard pattern for making operations idempotent is to assign each operation a unique key (usually a UUID generated by the caller) and have the server record which keys it has already processed.

Client (Orders Service) sends: POST /payments/charge { "idempotency_key": "a4f3b2c1-...", // UUID generated by caller "customer_id": "cust_123", "amount": 4999, "currency": "USD" } Payment Service logic: 1. Check: have we seen idempotency_key "a4f3b2c1-..." before? 2. If YES → return the stored result from last time. Don't charge again. 3. If NO → process the charge, store the result against this key. 4. Return result. Caller can safely retry on timeout. If the first request succeeded, the retry gets back the same result without a second charge.

Figure 13.6 — Idempotency keys. The caller assigns a unique key per logical operation. The server uses this key to deduplicate retries.

A few practical details to get right:

The key must be generated by the caller, not the server. If the server generates the key, a timed-out request has no key to retry with.
Store the result, not just the key. The retry must get back the same response as the original call. Just storing "I've seen this key" is not enough — you need to return the original charge ID, transaction status, etc.
The deduplication window matters. How long do you keep idempotency records? Forever is expensive. 24 hours might be fine for payment retries but not for a background job that runs weekly. Choose based on your retry pattern.
The key scope matters. An idempotency key only deduplicates within one service. If you retry an entire Saga, each step needs its own idempotency key — typically derived from the Saga's ID plus the step number.

Deriving Step Keys from a Saga ID

A clean pattern: generate one UUID for the entire Saga (e.g., saga_id = "abc123"). For each step, derive the idempotency key deterministically: Step 1 uses "abc123-step-1", Step 2 uses "abc123-step-2". This way, if you replay a Saga from the beginning after a crash, each step gets the same key it had before — and won't execute twice.

The Dual-Write Problem

Here's a specific failure mode that trips up most teams when they first build event-driven systems. It's subtle, and the naive solution doesn't work.

Suppose your Orders service needs to do two things when creating an order:

Save the order to its database
Publish an OrderCreated event to the message queue (so the Inventory service can react)

These are two different systems. You cannot do both atomically. So what order do you do them in, and what happens when the process crashes between the two?

Option A: Write to DB first, then publish event 1. Save order to DB ✓ 2. Process crashes here 💀 3. Publish event to queue ← never happens Result: Order exists in DB, but Inventory service never reserved the stock. Silent inconsistency. Option B: Publish event first, then write to DB 1. Publish event to queue ✓ 2. Process crashes here 💀 3. Save order to DB ← never happens Result: Inventory service reserved stock for an order that doesn't exist in the DB. Also inconsistent.

Figure 13.7 — The dual-write problem. No matter which operation you do first, a crash between the two leaves the system inconsistent. There is no safe ordering.

Neither ordering is safe. This is not a bug you can fix by being clever. It's a fundamental consequence of having two separate systems with no shared transaction boundary.

The Outbox Pattern

The Outbox pattern solves this by turning two operations into one. Instead of writing to the database and separately publishing to the queue, you write to the database and write the event to a special outbox table in the same database transaction. A separate process (the Outbox Relay) then reads from the outbox table and publishes to the message queue.

Step 1 — Single atomic database transaction: BEGIN; INSERT INTO orders (id, ...) VALUES (...); INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{"order_id": 123, ...}'); COMMIT; // Atomic. Either both happen or neither does. Step 2 — Outbox Relay (separate process, runs continuously): LOOP: SELECT * FROM outbox WHERE published = false ORDER BY id LIMIT 100; FOR each row: publish to message queue ✓ UPDATE outbox SET published = true WHERE id = row.id; If the relay crashes after publishing but before marking as published → it republishes → queue gets a duplicate. This is why consumers MUST be idempotent.

Figure 13.8 — The Outbox pattern. The event is saved atomically with the business data. A relay process publishes it asynchronously. At-least-once delivery to the queue; consumers must handle duplicates.

The Outbox pattern gives you a strong guarantee: if the order is saved, the event will eventually be published. The relay might deliver the event more than once (at-least-once delivery), but never zero times. This is why consumers need to be idempotent — they'll handle the duplicate just fine.

A variation of this is the Transaction Log Tailing approach (also called Change Data Capture, or CDC). Instead of a separate outbox table, you read the database's replication log directly — the stream of all changes made to the database. Tools like Debezium do this for PostgreSQL, MySQL, and others. The advantage is you don't need to add an outbox table or change your application code. The disadvantage is operational complexity — you're now depending on the internals of your database's replication format.

The Outbox Pattern in one sentence

Write the event to your own database first (in the same transaction as your business data), then let a relay process publish it to the external queue asynchronously. You trade immediate publication for guaranteed eventual publication.

Putting It Together: What to Use When

By now you have several tools. The question is how to choose between them.

Use a local transaction (no distributed coordination) when you can

Before reaching for any of the above patterns, ask: can I restructure my data so this operation only touches one database? This is not always possible, but it's more often possible than engineers think. The best distributed transaction is the one you don't need.

For example, if your Orders and Inventory data live in the same database (even if exposed through different services), you can use a local transaction. The services can be separately deployed and owned — they just share a database. This is sometimes called the Shared Database pattern, and it's unfairly stigmatized. For teams that are early in service decomposition, it's often the pragmatic right answer.

Use 2PC for low-volume, high-value operations across few participants

If you're processing hundreds of transactions per day (not millions), the blocking behavior of 2PC is unlikely to matter. Financial settlement systems, inter-bank transfers, and regulatory reporting pipelines often use 2PC or XA transactions because the strong guarantee is worth the overhead. The key condition: few participants (2-3), low volume, and operations where correctness is more important than throughput.

Use Sagas for high-volume, multi-step business workflows

This is the right default for most modern distributed systems. Sagas work at any throughput, don't block on coordinator failures, and map naturally to business workflows that already have compensation logic ("cancel the order", "refund the payment"). Use orchestration when the flow has more than 3-4 steps or has complex failure handling. Use choreography for simpler, linear flows.

Always use the Outbox pattern when publishing events alongside database writes

This is not optional. The dual-write problem will eventually hit you in production if you don't address it, and the consequences (silent data inconsistency) are the worst kind of bug — hard to detect and hard to fix after the fact.

Decision guide: Do you need to update multiple services atomically? │ ├─ No → Use a local DB transaction. Done. │ ├─ Yes, low volume (<1k/day), few services? │ └─ Consider 2PC / XA if strong consistency is critical. │ └─ Yes, high volume OR many services? └─ Use Sagas. │ ├─ Simple linear flow, few steps? │ └─ Choreography (event-driven) │ └─ Complex flow, many failure cases? └─ Orchestration (explicit state machine) Do you publish events alongside DB writes? └─ Always use the Outbox pattern. No exceptions.

Figure 13.9 — A decision guide for choosing the right coordination pattern.

A Real-World Example: Booking a Flight

Let's make this concrete with a worked example. You're building a flight booking system. When a customer books a flight, you need to:

Reserve the seat (Flights service)
Create the booking record (Bookings service)
Charge the customer (Payments service)
Send a confirmation email (Notifications service)

Step 4 is important: you cannot compensate for sending an email. Once it's sent, it's sent. This is called an irreversible operation, and it shapes the design significantly.

The right approach: put the email notification last, after everything else has committed successfully. Never put irreversible operations in the middle of a Saga. This way, if anything fails before the email step, you can compensate cleanly. You only send the email once you're certain the booking is complete.

Reserve seat (reversible)

Mark the seat as "pending" (not "booked"). Compensation: mark as available again. Easily reversible.

Create booking record (reversible)

Create a booking in "pending" state. Compensation: delete or mark as "cancelled". Reversible.

Charge payment (semi-reversible)

Charge the card. Compensation: issue a refund. Not instant — refunds take days — but logically reversible. This is the hardest step to compensate, so put it last among reversible operations.

Confirm seat + booking (reversible)

Flip both from "pending" to "confirmed". Compensation: flip back. Still reversible.

Send confirmation email (irreversible — do this last)

Send the email only after all prior steps have committed. If this step fails, you don't need to compensate — you just retry until the email goes out. The booking is already confirmed.

"Put irreversible operations last. Never put them in the middle, where a later failure would require you to compensate for something you can't undo."

Failure Handling: The Part Most Teams Skip

Building the happy path of a Saga is not that hard. The hard part is handling every failure scenario explicitly. Here are the failures you must think through:

Step failure

A step returns an error. This is the easy case. Run the compensations. Log the failure. Alert if necessary.

Step timeout

You called the Payments service and it didn't respond within 5 seconds. Do you retry? What if it actually succeeded and the response got lost? This is why idempotency is non-negotiable — retry with the same idempotency key and you'll get the right answer either way.

Orchestrator crash mid-Saga

The orchestrator itself can crash. This means it must persist its state durably. On restart, it must be able to resume a Saga from exactly the step where it left off, or detect that a step was in-flight and retry it. If you use a workflow engine (like Temporal, AWS Step Functions, or Conductor), this is handled for you. If you build it yourself, you need a durable state store and careful recovery logic.

Compensation failure

What if a compensating transaction fails? For example, you're trying to release a seat reservation, but the Flights service is down. Now you have a booking that was cancelled in the Bookings service but still appears reserved in the Flights service. This is called a compensation failure, and it requires manual intervention or a dead-letter queue where failed compensations are retried until they succeed.

Compensations must also be idempotent

You'll retry failed compensations. That means releasing an inventory reservation twice must be safe. Cancelling a booking twice must be safe. Every compensating transaction has the same idempotency requirement as the forward transaction. Build this in from the start, not as an afterthought.

The Bottom Line

Distributed transactions are fundamentally hard because you're trying to make multiple independent systems behave as one. Every solution is a compromise:

2PC gives you strong atomicity but blocks on coordinator failure
Sagas give you scalable, non-blocking coordination but sacrifice isolation
Choreography gives you loose coupling but makes the flow invisible
Orchestration makes the flow visible but introduces a central coordinator
Outbox solves dual-write but requires a relay process and idempotent consumers

There is no option on this list that has no downsides. The skill is not finding the perfect option — it's being honest about the trade-offs and choosing the one that fits your system's actual requirements.

The most important mindset shift: stop thinking about "how do I make this atomic?" and start thinking about "how do I make this recoverable?" Recoverability — being able to detect and correct inconsistency — is more achievable and more durable than trying to prevent inconsistency from ever occurring in the first place.

Transactions Across Services

// What's in this chapter

Key Learnings