Part IV Chapter 17 of 38

Event Sourcing and the Immutable Log

Most systems store the current state of the world. Event sourcing stores what happened to produce that state. That small shift in thinking changes almost everything about how you build, debug, and evolve a system — and it introduces problems that are surprisingly hard to solve.

~35 min Read time
Part IV Consistency & Correctness
Ch 16 ← → Ch 18 Neighbours

What's in this chapter

  • Why storing state is the wrong default — and what storing events gives you instead
  • The append-only log as the foundation for distributed systems (the Kafka worldview)
  • Event sourcing in depth: how it works, where it breaks down
  • CQRS: why the read model and write model should be separate
  • The projections problem: rebuilding read models from a long event log
  • The dark side: when event sourcing makes things much harder
  • When to use it and — equally important — when not to

Key Learnings — If You Only Read This Section

Events are facts, state is a snapshot. An event says "this happened." State says "this is true right now." Events are immutable; state is derived. If you have the events, you can always reconstruct the state. If you only have the state, the history is gone forever.

The append-only log is the most durable data structure. It's a single, ordered sequence of facts. No updates, no deletes. Kafka, database WALs, and git commits are all the same idea.

Event sourcing gives you a time machine and an audit log for free. Because you keep every event, you can replay history, reproduce bugs, and ask questions about the past that you didn't think to ask when you built the system.

CQRS separates reading from writing. You write events to an append-only log. You read from a separate projection (a view built from those events). This sounds like extra work — and it is — but it lets you optimize reads and writes completely independently.

Rebuilding a projection from millions of events is painfully slow without snapshots. Snapshots are periodic checkpoints of derived state so you don't replay the full history every time. They add operational complexity.

Event schemas are the hardest thing to change in an event-sourced system. You can't update old events. If your event structure was wrong, you're stuck with it forever — or you build a migration path that's more complex than the original system.

Most applications don't need event sourcing. A CRUD app with a PostgreSQL database is fine. Event sourcing is valuable when audit trails, temporal queries, event-driven integration, or complex domain logic genuinely justify the overhead.

The Problem with Storing State

Imagine you're building an online bank. A user's account has a balance. The simplest thing is to store that balance directly — one row in a table, one column called balance. When money comes in, you update it. When money goes out, you update it again.

Now your customer calls and says: "I think there was an unauthorized charge on my account last Tuesday." What do you do? You look at the current balance. But that tells you nothing about last Tuesday. You've overwritten the past. It's gone.

This is the fundamental problem with mutable state. Every time you update a record, you destroy the information about what it was before. The current value is all you have.

Banks have always known this. Their ledger is not a single row that gets updated. It's a list of transactions — an append-only record of every credit and debit. The balance is not stored; it is computed by summing the transactions. The transactions are the truth. The balance is a derived view of that truth.

Core Idea

The traditional approach asks: what is the current state? Event sourcing asks: what happened, in order? State is just the current answer to replaying all the events from the beginning.

This idea — recording a sequence of events instead of updating state in place — is the foundation of event sourcing. And once you see it, you'll notice it everywhere: your database's write-ahead log, git commits, accounting ledgers, even the undo history in a text editor.

The Append-Only Log

Before we talk about event sourcing as an application pattern, let's talk about the data structure underneath it: the append-only log.

A log is the simplest possible data structure. You can only do one thing to it: add a new entry at the end. You cannot update an existing entry. You cannot delete one. Entries are ordered and numbered. That's it.

Append-Only Log Offset Timestamp Event ────────────────────────────────────────────────────── 0 2024-01-10 09:01:22 AccountOpened { id: "acc-1", owner: "alice" } 1 2024-01-10 09:03:11 MoneyDeposited { amount: 500, currency: "USD" } 2 2024-01-11 14:22:08 MoneyWithdrawn { amount: 50, currency: "USD" } 3 2024-01-12 10:05:44 MoneyDeposited { amount: 200, currency: "USD" } 4 2024-01-13 16:41:30 MoneyWithdrawn { amount: 120, currency: "USD" } ────────────────────────────────────────────────────── ↑ entries are immutable, new entries appended at the end Current balance = 500 - 50 + 200 - 120 = $530

This structure is deceptively powerful. Because entries are immutable and ordered, the log has a property that most data structures don't: it is the source of truth, not a reflection of it. Every other view of the data — a balance, a dashboard, a search index — is a derived view of the log.

Why Logs Show Up Everywhere

If you look closely, logs are underneath almost every reliable system:

The pattern is the same every time: the log records what happened; everything else is computed from the log. This is not a coincidence. It reflects something true about how reliable systems should be built.

The Key Insight

Jay Kreps (one of Kafka's creators) wrote a now-famous post called "The Log: What every software engineer should know." His core observation: the log is the canonical record of events in a distributed system. Everything else — databases, caches, search indexes — is a materialized view of the log.

Event Sourcing

Event sourcing takes the log idea and applies it to your application's domain. Instead of storing the current state of your domain objects, you store the sequence of events that led to that state.

Let's make this concrete with an e-commerce order.

Traditional vs. Event-Sourced

In a traditional system, you have an orders table. The row for order #1234 might look like:

-- Traditional: current state only SELECT * FROM orders WHERE id = '1234'; id | status | total | updated_at '1234' | 'shipped' | 89.99 | '2024-01-15 10:30:00'

You can see that the order was shipped. You cannot see that it was placed, then modified, then paid for, then dispatched. All that history is gone — unless you explicitly built an audit log separately, which most teams don't.

In an event-sourced system, you store the events:

-- Event sourced: full history SELECT * FROM order_events WHERE order_id = '1234' ORDER BY seq; seq | event_type | data 1 | OrderPlaced | { items: [...], total: 89.99 } 2 | ItemRemoved | { item_id: "X", new_total: 79.99 } 3 | ItemAdded | { item_id: "Y", new_total: 89.99 } 4 | PaymentReceived | { amount: 89.99, method: "card" } 5 | ShipmentDispatched | { tracking: "UPS-XYZ" }

To get the current state, you replay these events through a function that accumulates them:

def apply(state, event): if event.type == "OrderPlaced": return Order(id=event.order_id, status="pending", total=event.data.total) elif event.type == "PaymentReceived": return state.with_status("paid") elif event.type == "ShipmentDispatched": return state.with_status("shipped").with_tracking(event.data.tracking) # ... other cases current_state = reduce(apply, events, initial_state)

The apply function is pure — given the same sequence of events, it always produces the same state. This is important, as we'll see.

What Event Sourcing Gives You

The benefits are real and they're significant in the right context.

Complete audit trail. Because you keep every event, you have a complete, immutable record of everything that ever happened to every entity. This is not a secondary audit table bolted on — it's the primary data. Regulators love this. Security teams love this.

Temporal queries. You can ask "what did this order look like at 2pm on Tuesday?" by replaying events up to that timestamp. With mutable state, this question is unanswerable.

Debugging is fundamentally different. When a bug is reported, you can replay the exact sequence of events that led to it. You're not trying to reconstruct what happened from log messages and intuition — you have the actual events. You can run them through a corrected version of your code and see what the correct output would have been.

New read models from old data. If you decide 18 months into your product that you want a new dashboard or analytics view, you can build a new projection by replaying all your historical events through new code. You're not limited to the views you thought to build when you started.

Integration via events. Other services can subscribe to your event stream and build their own views. They don't need to query your database. They listen to what happened.

CQRS: Separating Reads from Writes

Event sourcing almost always comes packaged with another pattern called CQRS — Command Query Responsibility Segregation. The name is intimidating but the idea is simple: the model you use to change data doesn't have to be the same model you use to read data.

In a traditional system, you read from and write to the same table. This means the schema has to serve both purposes. Sometimes that's fine. But often the queries you want to run don't match the shape of the data you're writing.

With event sourcing, the write side is simple: validate a command, produce one or more events, append them to the log. The read side is a separate concern: take the events and build a projection — a view of the data optimized for querying.

CQRS + Event Sourcing — Data Flow Command (PlaceOrder, PayOrder) │ ▼ ┌─────────────────┐ │ Command Handler │ ← validates, applies business rules └────────┬────────┘ │ produces ▼ ┌──────────────────────────┐ │ Event Log (Kafka / │ │ EventStore / Postgres) │ └──────────────┬────────────┘ │ ┌─────────────┼──────────────┐ ▼ ▼ ▼ Projection A Projection B Projection C (Order status (Search index (Analytics for users) for ops) dashboard) │ │ │ ▼ ▼ ▼ Read DB Elasticsearch Data Warehouse Write side: simple, event-producing Read side: multiple, independently optimized projections

Each projection is a consumer that reads the event log and builds its own read model — a database table, a search index, a cache, whatever the query pattern requires. If you need a new query, you add a new projection. The write side doesn't change.

Why This Is Powerful

The read model and write model can use completely different databases. Your writes go into an event log. One projection builds a relational DB for transactional queries. Another builds an Elasticsearch index for full-text search. Another feeds a data warehouse for analytics. They all come from the same events.

The Consistency Trade-off in CQRS

There's a cost. Because projections are built asynchronously from the event log, they are eventually consistent. If a user places an order and immediately asks "what is my order status?", the projection might not have processed the OrderPlaced event yet.

This is a real problem for user-facing features and many teams underestimate it. There are workarounds — reading directly from the event log for the most recent state, using a version number to detect stale reads — but they add complexity. The simplicity of "read the same thing you just wrote" is gone.

The Projections Problem

This is the part that most introductory articles about event sourcing skip. It's where the model gets hard.

A projection is built by reading the event log from the beginning and applying each event to build up the read model. If you have 100 events, this is instant. If you have 100 million events, this takes a long time. As your system ages, rebuilding projections becomes slower and slower.

Why You Need to Rebuild Projections

You will need to rebuild projections more often than you expect. Common reasons:

When your event log has accumulated years of history, "replay from the beginning" is not a fast operation. You might be looking at hours or days of rebuild time.

Snapshots: The Mitigation

The standard solution is snapshotting. Periodically, you save the current state of the projection as a checkpoint. When you need to rebuild, you start from the most recent snapshot rather than from the very beginning of the log.

Snapshots Reduce Replay Time Event log: │e1│e2│e3│...│e10000│e10001│e10002│...│e50000│e50001│e50002│...│eN│ ↑ ↑ Snapshot at t=10000 Snapshot at t=50000 To rebuild to current state: ✗ Without snapshots: replay all N events (could be millions) ✓ With snapshots: load snapshot at 50000 + replay only N-50000 events

Snapshots work well, but they add operational complexity: you need to store them, version them, and invalidate them when the projection code changes. If you change the projection code, a snapshot built with the old code is now wrong — you can't use it, you have to go back to a pre-snapshot event and rebuild from there.

Running Old and New Projections in Parallel

When you change projection code, you can't just update the running projection. You need to build the new projection alongside the old one, verify it's correct, and then atomically cut over reads from old to new. This is a deployment challenge that most teams don't plan for until they're doing it for the first time.

Operational Reality

Projection rebuilds in production are stressful. If your event log is in Kafka, rebuilding a large projection hammers the Kafka cluster and can affect the latency of live event processing. You need to rebuild in a separate consumer group, with rate limiting, tested carefully. This is unglamorous work that takes days to get right.

Event Schema Evolution — The Hardest Part

Here's the part that bites almost every team that adopts event sourcing: you cannot change events that already happened.

If you shipped an OrderPlaced event three years ago with a certain schema, those events exist in your log. They are immutable. Your projection code has to handle them. Forever.

When your requirements change — and they always do — you have a few options, none of them free:

Strategies for Evolving Event Schemas

Upcasting. When reading an old event, transform it to the new schema on the way into your projection. The raw event is unchanged; you add a translation layer that knows how to convert old formats to new ones. This works but every old event format adds permanent code complexity.

def upcast(event): if event.type == "OrderPlaced" and event.version == 1: # v1 didn't have currency field, default to USD event.data["currency"] = "USD" event.version = 2 return event

Versioning events. When the schema changes incompatibly, create a new event type. Instead of changing OrderPlaced, introduce OrderPlacedV2. Your projection handles both. Over time you accumulate many versions. This is technically clean but verbose.

Copy-transform. Write a migration that reads the old event log, transforms the events into the new schema, and writes them to a new log. Then cut over to the new log. Expensive, operationally risky, but produces a clean log. Only practical for breaking changes where you want to start fresh.

Strategy Complexity Best For Watch Out For
Upcasting Low upfront, accumulates over time Additive changes (new optional fields) Upcast chains get deep over years
Event versioning Medium Significant schema changes Many version cases in every handler
Copy-transform migration High Breaking changes, full schema rewrites Risky cutover, expensive storage

The lesson is not that schema evolution is impossible — it's that you need to treat your event schema as a permanent public API, not an internal implementation detail. Before you publish an event, ask: "Would I be comfortable maintaining backward compatibility with this schema for the next five years?" Because that's what you're committing to.

The Dark Side: When Event Sourcing Hurts

Event sourcing has genuine costs that get glossed over in enthusiastic blog posts. Let's be direct about them.

Simple Queries Become Complex

In a regular database: "give me all orders over $100 that are in 'pending' status" is one SQL query.

In an event-sourced system: you need a projection that has already computed this view. If you don't have one, you either build a new projection (wait for backfill) or scan the event log (slow and expensive). The simplicity of ad-hoc queries over relational data is gone.

You Can't Delete Data

"Delete all data for this user" — a routine GDPR request in a traditional system — becomes an architectural crisis in an event-sourced system. The user's data is baked into the event log, which is immutable.

Workarounds exist: crypto-shredding (encrypt user data with a per-user key, then delete the key), separate PII storage referenced by ID in events, explicit erasure events that projections interpret as "forget this data." None of these are simple and all require planning from day one.

GDPR and Event Sourcing

If your system handles personal data covered by GDPR or similar regulations, design your erasure strategy before you write your first event. Retrofitting it later is extremely painful. The most common mistake is storing PII directly in events — encrypt it or reference it by ID from a separately deletable store.

The Learning Curve Is Steeper Than It Looks

Most developers are comfortable with CRUD. Event sourcing requires thinking in terms of commands, events, aggregates, projections, and sagas. These are learnable concepts, but there's a real ramp-up period. A team adopting event sourcing for the first time will be slower for the first few months, not faster.

The Tooling Gap

Relational databases have 40 years of tooling: ORMs, query builders, migration tools, GUI clients, backup utilities, monitoring integrations. Event stores are newer. EventStoreDB, Axon Server, and Kafka-based approaches all work, but the ecosystem is thinner. You'll encounter rough edges.

When to Use It, When Not To

Event sourcing is a powerful tool for the right problem. It is not a default choice or an architectural upgrade. Here's a practical guide.

Use Event Sourcing When:

Don't Use Event Sourcing When:

A Middle Path

You don't have to go all-in on event sourcing. Many systems benefit from a hybrid: store mutable state in a relational database AND publish domain events to an event bus for integration. You get event-driven integration without the full complexity of event sourcing. This is often the right starting point.

Practical Implementation Notes

Aggregate Boundaries

In event sourcing, an aggregate is the unit of consistency. All events for one aggregate (e.g., one order, one account) are stored together and processed in sequence. Events across different aggregates can be processed in parallel.

Choosing aggregate boundaries is one of the most consequential design decisions. Too large: your aggregates become god objects that hold too much state and contend on writes. Too small: you find yourself needing transactions across aggregates, which is hard to do correctly in an event-sourced system.

A useful rule: an aggregate should enforce all the business invariants that need to be true at the same time. If "an order total must equal the sum of its line items" is a rule, then order and its line items should be in the same aggregate. If "an order total must be less than the customer's credit limit" requires reading the customer's state, you may need to handle that differently — possibly accepting temporary inconsistency and compensating.

Choosing an Event Store

You have several options for where to store the event log:

-- Simple events table in PostgreSQL CREATE TABLE events ( id BIGSERIAL PRIMARY KEY, stream_id UUID NOT NULL, -- aggregate ID stream_type TEXT NOT NULL, -- e.g. 'Order', 'Account' version INT NOT NULL, -- sequence within stream event_type TEXT NOT NULL, data JSONB NOT NULL, metadata JSONB NOT NULL DEFAULT '{}', created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), UNIQUE (stream_id, version) -- optimistic concurrency control ); CREATE INDEX idx_events_stream ON events (stream_id, version);

The UNIQUE (stream_id, version) constraint is critical. It gives you optimistic concurrency control: if two processes try to append version 5 to the same stream simultaneously, one will get a unique constraint violation. The losing process re-reads the current state and retries. This prevents lost updates without requiring locking.

The Key Principle of This Chapter

State is a snapshot; events are the truth. If you store what happened rather than just what is, you gain history, debuggability, and flexibility — but you trade simplicity and must treat your event schema as a permanent contract.

The Most Common Mistake

Treating event sourcing as an architectural upgrade you can apply to any system to make it better. It's not. Applied to a CRUD application without a genuine need for history or event-driven integration, event sourcing adds weeks of complexity and an ongoing operational burden with no offsetting benefit. Match the tool to the problem.

Three Questions for Your Next Design Review

  1. Do we have genuine requirements for audit trails, temporal queries, or event-driven integration — or are we just attracted to the pattern because it sounds sophisticated?
  2. Have we designed our event schemas as permanent contracts, and do we have a strategy for schema evolution before the first event is written?
  3. How will we handle a GDPR deletion request for a user whose data is embedded in 50,000 historical events?