Chapter 17 — Event Sourcing and the Immutable Log

What's in this chapter

Why storing state is the wrong default — and what storing events gives you instead
The append-only log as the foundation for distributed systems (the Kafka worldview)
Event sourcing in depth: how it works, where it breaks down
CQRS: why the read model and write model should be separate
The projections problem: rebuilding read models from a long event log
The dark side: when event sourcing makes things much harder
When to use it and — equally important — when not to

Key Learnings — If You Only Read This Section

Events are facts, state is a snapshot. An event says "this happened." State says "this is true right now." Events are immutable; state is derived. If you have the events, you can always reconstruct the state. If you only have the state, the history is gone forever.

The append-only log is the most durable data structure. It's a single, ordered sequence of facts. No updates, no deletes. Kafka, database WALs, and git commits are all the same idea.

Event sourcing gives you a time machine and an audit log for free. Because you keep every event, you can replay history, reproduce bugs, and ask questions about the past that you didn't think to ask when you built the system.

CQRS separates reading from writing. You write events to an append-only log. You read from a separate projection (a view built from those events). This sounds like extra work — and it is — but it lets you optimize reads and writes completely independently.

Rebuilding a projection from millions of events is painfully slow without snapshots. Snapshots are periodic checkpoints of derived state so you don't replay the full history every time. They add operational complexity.

Event schemas are the hardest thing to change in an event-sourced system. You can't update old events. If your event structure was wrong, you're stuck with it forever — or you build a migration path that's more complex than the original system.

Most applications don't need event sourcing. A CRUD app with a PostgreSQL database is fine. Event sourcing is valuable when audit trails, temporal queries, event-driven integration, or complex domain logic genuinely justify the overhead.

The Problem with Storing State

Imagine you're building an online bank. A user's account has a balance. The simplest thing is to store that balance directly — one row in a table, one column called balance. When money comes in, you update it. When money goes out, you update it again.

Now your customer calls and says: "I think there was an unauthorized charge on my account last Tuesday." What do you do? You look at the current balance. But that tells you nothing about last Tuesday. You've overwritten the past. It's gone.

This is the fundamental problem with mutable state. Every time you update a record, you destroy the information about what it was before. The current value is all you have.

Banks have always known this. Their ledger is not a single row that gets updated. It's a list of transactions — an append-only record of every credit and debit. The balance is not stored; it is computed by summing the transactions. The transactions are the truth. The balance is a derived view of that truth.

Core Idea

The traditional approach asks: what is the current state? Event sourcing asks: what happened, in order? State is just the current answer to replaying all the events from the beginning.

This idea — recording a sequence of events instead of updating state in place — is the foundation of event sourcing. And once you see it, you'll notice it everywhere: your database's write-ahead log, git commits, accounting ledgers, even the undo history in a text editor.

The Append-Only Log

Before we talk about event sourcing as an application pattern, let's talk about the data structure underneath it: the append-only log.

A log is the simplest possible data structure. You can only do one thing to it: add a new entry at the end. You cannot update an existing entry. You cannot delete one. Entries are ordered and numbered. That's it.

Append-Only Log Offset Timestamp Event ────────────────────────────────────────────────────── 0 2024-01-10 09:01:22 AccountOpened { id: "acc-1", owner: "alice" } 1 2024-01-10 09:03:11 MoneyDeposited { amount: 500, currency: "USD" } 2 2024-01-11 14:22:08 MoneyWithdrawn { amount: 50, currency: "USD" } 3 2024-01-12 10:05:44 MoneyDeposited { amount: 200, currency: "USD" } 4 2024-01-13 16:41:30 MoneyWithdrawn { amount: 120, currency: "USD" } ────────────────────────────────────────────────────── ↑ entries are immutable, new entries appended at the end Current balance = 500 - 50 + 200 - 120 = $530

This structure is deceptively powerful. Because entries are immutable and ordered, the log has a property that most data structures don't: it is the source of truth, not a reflection of it. Every other view of the data — a balance, a dashboard, a search index — is a derived view of the log.

Why Logs Show Up Everywhere

If you look closely, logs are underneath almost every reliable system:

A database write-ahead log (WAL) records every change before it's applied to the data files. If the database crashes, it replays the WAL to recover. The WAL is the truth; the data files are a cached view.
Git is an append-only log of commits. Your working directory is a derived view of replaying all commits up to HEAD.
Kafka is a distributed, persistent, append-only log that multiple consumers can read at their own pace.
A blockchain is an append-only log with cryptographic proofs linking entries.

The pattern is the same every time: the log records what happened; everything else is computed from the log. This is not a coincidence. It reflects something true about how reliable systems should be built.

The Key Insight

Jay Kreps (one of Kafka's creators) wrote a now-famous post called "The Log: What every software engineer should know." His core observation: the log is the canonical record of events in a distributed system. Everything else — databases, caches, search indexes — is a materialized view of the log.

Event Sourcing

Event sourcing takes the log idea and applies it to your application's domain. Instead of storing the current state of your domain objects, you store the sequence of events that led to that state.

Let's make this concrete with an e-commerce order.

Traditional vs. Event-Sourced

In a traditional system, you have an orders table. The row for order #1234 might look like:

-- Traditional: current state only
SELECT * FROM orders WHERE id = '1234';

id       | status    | total  | updated_at
'1234'   | 'shipped' | 89.99  | '2024-01-15 10:30:00'
    

You can see that the order was shipped. You cannot see that it was placed, then modified, then paid for, then dispatched. All that history is gone — unless you explicitly built an audit log separately, which most teams don't.

In an event-sourced system, you store the events:

-- Event sourced: full history
SELECT * FROM order_events WHERE order_id = '1234' ORDER BY seq;

seq | event_type         | data
1   | OrderPlaced         | { items: [...], total: 89.99 }
2   | ItemRemoved         | { item_id: "X", new_total: 79.99 }
3   | ItemAdded           | { item_id: "Y", new_total: 89.99 }
4   | PaymentReceived     | { amount: 89.99, method: "card" }
5   | ShipmentDispatched  | { tracking: "UPS-XYZ" }
    

To get the current state, you replay these events through a function that accumulates them:

def apply(state, event):
    if event.type == "OrderPlaced":
        return Order(id=event.order_id, status="pending", total=event.data.total)
    elif event.type == "PaymentReceived":
        return state.with_status("paid")
    elif event.type == "ShipmentDispatched":
        return state.with_status("shipped").with_tracking(event.data.tracking)
    # ... other cases

current_state = reduce(apply, events, initial_state)
    

The apply function is pure — given the same sequence of events, it always produces the same state. This is important, as we'll see.

What Event Sourcing Gives You

The benefits are real and they're significant in the right context.

Complete audit trail. Because you keep every event, you have a complete, immutable record of everything that ever happened to every entity. This is not a secondary audit table bolted on — it's the primary data. Regulators love this. Security teams love this.

Temporal queries. You can ask "what did this order look like at 2pm on Tuesday?" by replaying events up to that timestamp. With mutable state, this question is unanswerable.

Debugging is fundamentally different. When a bug is reported, you can replay the exact sequence of events that led to it. You're not trying to reconstruct what happened from log messages and intuition — you have the actual events. You can run them through a corrected version of your code and see what the correct output would have been.

New read models from old data. If you decide 18 months into your product that you want a new dashboard or analytics view, you can build a new projection by replaying all your historical events through new code. You're not limited to the views you thought to build when you started.

Integration via events. Other services can subscribe to your event stream and build their own views. They don't need to query your database. They listen to what happened.

CQRS: Separating Reads from Writes

Event sourcing almost always comes packaged with another pattern called CQRS — Command Query Responsibility Segregation. The name is intimidating but the idea is simple: the model you use to change data doesn't have to be the same model you use to read data.

In a traditional system, you read from and write to the same table. This means the schema has to serve both purposes. Sometimes that's fine. But often the queries you want to run don't match the shape of the data you're writing.

With event sourcing, the write side is simple: validate a command, produce one or more events, append them to the log. The read side is a separate concern: take the events and build a projection — a view of the data optimized for querying.

CQRS + Event Sourcing — Data Flow Command (PlaceOrder, PayOrder) │ ▼ ┌─────────────────┐ │ Command Handler │ ← validates, applies business rules └────────┬────────┘ │ produces ▼ ┌──────────────────────────┐ │ Event Log (Kafka / │ │ EventStore / Postgres) │ └──────────────┬────────────┘ │ ┌─────────────┼──────────────┐ ▼ ▼ ▼ Projection A Projection B Projection C (Order status (Search index (Analytics for users) for ops) dashboard) │ │ │ ▼ ▼ ▼ Read DB Elasticsearch Data Warehouse Write side: simple, event-producing Read side: multiple, independently optimized projections

Each projection is a consumer that reads the event log and builds its own read model — a database table, a search index, a cache, whatever the query pattern requires. If you need a new query, you add a new projection. The write side doesn't change.

Why This Is Powerful

The read model and write model can use completely different databases. Your writes go into an event log. One projection builds a relational DB for transactional queries. Another builds an Elasticsearch index for full-text search. Another feeds a data warehouse for analytics. They all come from the same events.

The Consistency Trade-off in CQRS

There's a cost. Because projections are built asynchronously from the event log, they are eventually consistent. If a user places an order and immediately asks "what is my order status?", the projection might not have processed the OrderPlaced event yet.

This is a real problem for user-facing features and many teams underestimate it. There are workarounds — reading directly from the event log for the most recent state, using a version number to detect stale reads — but they add complexity. The simplicity of "read the same thing you just wrote" is gone.

The Projections Problem

This is the part that most introductory articles about event sourcing skip. It's where the model gets hard.

A projection is built by reading the event log from the beginning and applying each event to build up the read model. If you have 100 events, this is instant. If you have 100 million events, this takes a long time. As your system ages, rebuilding projections becomes slower and slower.

Why You Need to Rebuild Projections

You will need to rebuild projections more often than you expect. Common reasons:

You found a bug in your projection code and need to recompute from clean data
You changed the schema of the projection
You're adding a new projection that needs to backfill historical data
The projection database became corrupted
You're migrating to a new infrastructure

When your event log has accumulated years of history, "replay from the beginning" is not a fast operation. You might be looking at hours or days of rebuild time.

Snapshots: The Mitigation

The standard solution is snapshotting. Periodically, you save the current state of the projection as a checkpoint. When you need to rebuild, you start from the most recent snapshot rather than from the very beginning of the log.

Snapshots Reduce Replay Time Event log: │e1│e2│e3│...│e10000│e10001│e10002│...│e50000│e50001│e50002│...│eN│ ↑ ↑ Snapshot at t=10000 Snapshot at t=50000 To rebuild to current state: ✗ Without snapshots: replay all N events (could be millions) ✓ With snapshots: load snapshot at 50000 + replay only N-50000 events

Snapshots work well, but they add operational complexity: you need to store them, version them, and invalidate them when the projection code changes. If you change the projection code, a snapshot built with the old code is now wrong — you can't use it, you have to go back to a pre-snapshot event and rebuild from there.

Running Old and New Projections in Parallel

When you change projection code, you can't just update the running projection. You need to build the new projection alongside the old one, verify it's correct, and then atomically cut over reads from old to new. This is a deployment challenge that most teams don't plan for until they're doing it for the first time.

Operational Reality

Projection rebuilds in production are stressful. If your event log is in Kafka, rebuilding a large projection hammers the Kafka cluster and can affect the latency of live event processing. You need to rebuild in a separate consumer group, with rate limiting, tested carefully. This is unglamorous work that takes days to get right.

Event Schema Evolution — The Hardest Part

Here's the part that bites almost every team that adopts event sourcing: you cannot change events that already happened.

If you shipped an OrderPlaced event three years ago with a certain schema, those events exist in your log. They are immutable. Your projection code has to handle them. Forever.

When your requirements change — and they always do — you have a few options, none of them free:

Strategies for Evolving Event Schemas

Upcasting. When reading an old event, transform it to the new schema on the way into your projection. The raw event is unchanged; you add a translation layer that knows how to convert old formats to new ones. This works but every old event format adds permanent code complexity.

def upcast(event):
    if event.type == "OrderPlaced" and event.version == 1:
        # v1 didn't have currency field, default to USD
        event.data["currency"] = "USD"
        event.version = 2
    return event
    

Versioning events. When the schema changes incompatibly, create a new event type. Instead of changing OrderPlaced, introduce OrderPlacedV2. Your projection handles both. Over time you accumulate many versions. This is technically clean but verbose.

Copy-transform. Write a migration that reads the old event log, transforms the events into the new schema, and writes them to a new log. Then cut over to the new log. Expensive, operationally risky, but produces a clean log. Only practical for breaking changes where you want to start fresh.

Strategy	Complexity	Best For	Watch Out For
Upcasting	Low upfront, accumulates over time	Additive changes (new optional fields)	Upcast chains get deep over years
Event versioning	Medium	Significant schema changes	Many version cases in every handler
Copy-transform migration	High	Breaking changes, full schema rewrites	Risky cutover, expensive storage

The lesson is not that schema evolution is impossible — it's that you need to treat your event schema as a permanent public API, not an internal implementation detail. Before you publish an event, ask: "Would I be comfortable maintaining backward compatibility with this schema for the next five years?" Because that's what you're committing to.

The Dark Side: When Event Sourcing Hurts

Event sourcing has genuine costs that get glossed over in enthusiastic blog posts. Let's be direct about them.

Simple Queries Become Complex

In a regular database: "give me all orders over $100 that are in 'pending' status" is one SQL query.

In an event-sourced system: you need a projection that has already computed this view. If you don't have one, you either build a new projection (wait for backfill) or scan the event log (slow and expensive). The simplicity of ad-hoc queries over relational data is gone.

"Delete all data for this user" — a routine GDPR request in a traditional system — becomes an architectural crisis in an event-sourced system. The user's data is baked into the event log, which is immutable.

Workarounds exist: crypto-shredding (encrypt user data with a per-user key, then delete the key), separate PII storage referenced by ID in events, explicit erasure events that projections interpret as "forget this data." None of these are simple and all require planning from day one.

GDPR and Event Sourcing

If your system handles personal data covered by GDPR or similar regulations, design your erasure strategy before you write your first event. Retrofitting it later is extremely painful. The most common mistake is storing PII directly in events — encrypt it or reference it by ID from a separately deletable store.

The Learning Curve Is Steeper Than It Looks

Most developers are comfortable with CRUD. Event sourcing requires thinking in terms of commands, events, aggregates, projections, and sagas. These are learnable concepts, but there's a real ramp-up period. A team adopting event sourcing for the first time will be slower for the first few months, not faster.

The Tooling Gap

Relational databases have 40 years of tooling: ORMs, query builders, migration tools, GUI clients, backup utilities, monitoring integrations. Event stores are newer. EventStoreDB, Axon Server, and Kafka-based approaches all work, but the ecosystem is thinner. You'll encounter rough edges.

When to Use It, When Not To

Event sourcing is a powerful tool for the right problem. It is not a default choice or an architectural upgrade. Here's a practical guide.

Use Event Sourcing When:

Audit and compliance are first-class requirements. Financial systems, healthcare records, e-commerce with dispute resolution, anywhere you need a complete, tamper-evident history. This is the strongest case.
You need to ask questions about the past that you haven't thought of yet. If your domain is complex and you expect analytics or reporting requirements to evolve over time, the ability to replay history against new projection code is genuinely valuable.
You're building an event-driven integration architecture. If multiple services need to react to things that happen in your domain, publishing events from an event log is cleaner than polling a database or calling APIs.
Your domain has complex, reversible workflows. Order management, insurance claims, loan applications — domains with multi-step workflows, approvals, and reversals map naturally to events.
Debugging production is a recurring problem. If you regularly find yourself wishing you could "replay what happened," event sourcing solves that problem structurally.

Don't Use Event Sourcing When:

Your team is already stretched. Event sourcing adds complexity. If the team is small and moving fast, the overhead will slow you down more than the benefits will help.
You're building a simple CRUD application. A content management system, a settings page, a user profile — these don't benefit from an immutable log of events. They'll just become harder to build and maintain.
The domain model is unclear. Event sourcing requires you to define your events well. If you're still figuring out the domain, you'll end up with a bad event schema that's expensive to change. Get clarity first.
You need ad-hoc queries as a core feature. If analysts or product managers need to run arbitrary queries against your data all the time, a relational database with good indexing will serve you better.

A Middle Path

You don't have to go all-in on event sourcing. Many systems benefit from a hybrid: store mutable state in a relational database AND publish domain events to an event bus for integration. You get event-driven integration without the full complexity of event sourcing. This is often the right starting point.

Practical Implementation Notes

Aggregate Boundaries

In event sourcing, an aggregate is the unit of consistency. All events for one aggregate (e.g., one order, one account) are stored together and processed in sequence. Events across different aggregates can be processed in parallel.

Choosing aggregate boundaries is one of the most consequential design decisions. Too large: your aggregates become god objects that hold too much state and contend on writes. Too small: you find yourself needing transactions across aggregates, which is hard to do correctly in an event-sourced system.

A useful rule: an aggregate should enforce all the business invariants that need to be true at the same time. If "an order total must equal the sum of its line items" is a rule, then order and its line items should be in the same aggregate. If "an order total must be less than the customer's credit limit" requires reading the customer's state, you may need to handle that differently — possibly accepting temporary inconsistency and compensating.

Choosing an Event Store

You have several options for where to store the event log:

PostgreSQL with an events table: simple, uses your existing infrastructure, gets you started quickly. Append-only inserts are fast. Querying by aggregate ID is easy. Works well for moderate scale.
EventStoreDB: purpose-built for event sourcing, native support for streams per aggregate, built-in subscription model. More operationally complex to run than Postgres.
Kafka: excellent for event distribution and multi-consumer scenarios, not ideal as the primary event store because offset-based addressing doesn't map cleanly to aggregate streams. Common to use Kafka for event distribution and a database as the source of truth.

-- Simple events table in PostgreSQL
CREATE TABLE events (
    id           BIGSERIAL PRIMARY KEY,
    stream_id    UUID NOT NULL,           -- aggregate ID
    stream_type  TEXT NOT NULL,           -- e.g. 'Order', 'Account'
    version      INT NOT NULL,            -- sequence within stream
    event_type   TEXT NOT NULL,
    data         JSONB NOT NULL,
    metadata     JSONB NOT NULL DEFAULT '{}',
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    UNIQUE (stream_id, version)  -- optimistic concurrency control
);

CREATE INDEX idx_events_stream ON events (stream_id, version);
    

The UNIQUE (stream_id, version) constraint is critical. It gives you optimistic concurrency control: if two processes try to append version 5 to the same stream simultaneously, one will get a unique constraint violation. The losing process re-reads the current state and retries. This prevents lost updates without requiring locking.

The Key Principle of This Chapter

State is a snapshot; events are the truth. If you store what happened rather than just what is, you gain history, debuggability, and flexibility — but you trade simplicity and must treat your event schema as a permanent contract.

The Most Common Mistake

Treating event sourcing as an architectural upgrade you can apply to any system to make it better. It's not. Applied to a CRUD application without a genuine need for history or event-driven integration, event sourcing adds weeks of complexity and an ongoing operational burden with no offsetting benefit. Match the tool to the problem.

Three Questions for Your Next Design Review

Do we have genuine requirements for audit trails, temporal queries, or event-driven integration — or are we just attracted to the pattern because it sounds sophisticated?
Have we designed our event schemas as permanent contracts, and do we have a strategy for schema evolution before the first event is written?
How will we handle a GDPR deletion request for a user whose data is embedded in 50,000 historical events?

← Previous Ch 16 — The Consistency Model Landscape Next → Ch 18 — Idempotency: The Superpower

What's in this chapter

Key Learnings — If You Only Read This Section

The Problem with Storing State

The Append-Only Log

Why Logs Show Up Everywhere

Event Sourcing

Traditional vs. Event-Sourced

What Event Sourcing Gives You

CQRS: Separating Reads from Writes

The Consistency Trade-off in CQRS

The Projections Problem

Why You Need to Rebuild Projections

Snapshots: The Mitigation

Running Old and New Projections in Parallel

Event Schema Evolution — The Hardest Part

Strategies for Evolving Event Schemas

The Dark Side: When Event Sourcing Hurts

Simple Queries Become Complex

You Can't Delete Data

The Learning Curve Is Steeper Than It Looks

The Tooling Gap

When to Use It, When Not To

Use Event Sourcing When:

Don't Use Event Sourcing When:

Practical Implementation Notes

Aggregate Boundaries

Choosing an Event Store

Three Questions for Your Next Design Review