Part V — Maintainability Chapter 19

Evolvability — Designing for Change You Can't Predict

You can deploy new code any time. You cannot undeploy old data.

What's in this chapter

Software changes constantly. New features get added, requirements shift, bugs get fixed. That's normal. The hard part is that your data doesn't change just because your code does.

In this chapter we look at how to write systems that can evolve over time without breaking. We start with the fundamental problem: old code and new code have to coexist and talk to each other. Then we work through the tools that solve this — encoding formats like Protocol Buffers and Avro, schema registries, API versioning strategies, database migration techniques, and the strangler fig pattern for replacing live systems without stopping the world.

By the end you'll have a clear mental model for what "backward compatible" and "forward compatible" really mean, and a practical toolkit for making changes safely.

Key learnings — the short version

Backward compatibility means new code can read old data. Forward compatibility means old code can read new data. You need both during a rolling deployment.
JSON is deceptively fragile for long-lived data — no schema enforcement, silent type coercions, and nothing to guide evolution.
Protocol Buffers use numbered field tags instead of names. As long as you never reuse a tag number, you can add and remove fields safely.
Avro has no field tags at all. It matches fields by name at read time using a writer's schema and a reader's schema — which means you must always store or transmit the schema alongside the data.
A schema registry solves the "where does the schema live?" problem. Producers register schemas, get back an ID, and embed only the ID in each message.
Every API version you publish is a support contract — plan version deprecation before you ship version one.
The expand-contract pattern is how you change a database schema without downtime: add first, migrate data, remove old column only when no code references it.
The strangler fig pattern is how you replace a whole system: route traffic to the new system incrementally, never stopping the old one until the new one handles everything.

The Core Problem: Your Data Outlives Your Code

Here is a situation every engineer faces eventually. You have a service running in production. It writes records to a database. The records look like this:

Current record format

{
  "user_id": 12345,
  "name": "Alice",
  "email": "alice@example.com"
}

Now you need to add a new field: phone_number. You update your code, test it, and deploy. But here's the thing: you still have millions of old records in the database that don't have the phone_number field. And your deployment is rolling — at any moment, half your servers are running old code and half are running new code.

This is the evolvability problem. It has three dimensions:

Old data, new code: Your new code reads a record that was written before phone_number existed. What does it do with the missing field?
New data, old code: Your old code reads a record that a new server just wrote, complete with phone_number. What does it do with the field it doesn't understand?
Old code, new API contract: A mobile client on version 3 hits your new version 4 API endpoint. Does it crash?

If you handle these three cases well, you can deploy continuously, roll back if needed, and change your data model without coordinated downtime. If you don't, you end up in a situation where every schema change requires a "big bang" deploy where all services update at exactly the same moment — which is stressful, risky, and gets worse as your system grows.

Worth noting

This problem is much older than microservices. Even a single application with a database faces it. The moment you store data on disk, you've created a contract between today's code and tomorrow's code.

Backward and Forward Compatibility

These two terms are used constantly but often confused. Let's define them precisely.

Backward compatibility means that newer code can read data written by older code. This is the direction most engineers think about first, and it's usually easier to achieve. When you add a new optional field, old records that don't have that field are still valid — new code just sees a null or a default.

Forward compatibility means that older code can read data written by newer code. This is the trickier direction. When new code writes a record with a new field, can old code read it without breaking? Usually the answer is "yes, if the old code just ignores fields it doesn't know about." But this requires that your serialization format actually supports ignoring unknown fields — and not all of them do.

Old Code ────────────────────────────────────────────────── writes {user_id, name, email} New Code ────────────────────────────────────────────────── writes {user_id, name, email, phone_number} Backward Compatibility (new code reads old data): New Code reads {user_id, name, email} → phone_number is missing → use default (null / "") ✓ Forward Compatibility (old code reads new data): Old Code reads {user_id, name, email, phone_number} → phone_number is unknown → ignore it ✓ → old code continues working ✓

Fig 19-1. Backward and forward compatibility during a rolling deployment

During a rolling deployment, you need both directions at the same time. New servers are writing new data that old servers might read. Old servers are writing old data that new servers will read. The system works correctly only if both directions hold.

What breaks compatibility?

Some changes are safe. Others are not. Here is a rough guide:

Change	Backward compat?	Forward compat?	Notes
Add an optional field	✓ Yes	✓ Yes (if old code ignores unknowns)	Safe in most cases
Remove an optional field	✓ Yes	✓ Yes (old code writes it, new code ignores it)	Watch for consumers that depend on it
Rename a field	✗ No	✗ No	Equivalent to remove + add. Never do this atomically.
Change a field's type	Maybe	Maybe	int32 → int64 is safe in some formats, not others
Add a required field	✗ No	✗ No	Old data won't have it. Required fields are nearly always a mistake.
Change a field's meaning	✗ No	✗ No	Even if the name and type stay the same, semantics matter

Common mistake

Renaming a field feels harmless in the code editor — it's just a string. But it's the most dangerous schema change you can make. To old readers, the old name is gone. To new readers, the new name is new. No reader can handle both unless you explicitly run both names in parallel for a transition period.

Encoding Formats Matter More Than You Think

The format you use to encode your data determines how much flexibility you have when things change. Let's look at the main options, starting with the most common and working toward the most rigorous.

JSON and XML — Flexible but Fragile

JSON is the default choice for most HTTP APIs and many data pipelines, and for good reason: it's human-readable, every language supports it, and you can open a message in a text editor. But JSON has several properties that make schema evolution harder than it looks.

First, JSON has no schema enforcement. Any key can appear in any document, and nobody checks. This sounds flexible but it means silent corruption is easy — a field gets misspelled, a type gets changed, and nobody finds out until production.

Second, JSON's type system is shallow. There are numbers, strings, booleans, arrays, and objects. There is no distinction between a 32-bit integer and a 64-bit one. JavaScript's JSON.parse will silently lose precision on integers larger than 2^53. This has caused real bugs in financial systems.

Third, JSON has no concept of schema versions. If you want to know what version of your schema a particular document conforms to, you have to add that information yourself, and you have to write code to act on it yourself.

None of this means you shouldn't use JSON. For short-lived data — an HTTP request and its response — these limitations rarely matter. Where they bite you is in long-lived storage or high-volume message passing where you need reliable schema evolution over months and years.

Binary formats — when throughput and structure matter

Binary encoding formats like Protocol Buffers, Apache Avro, and Apache Thrift solve the schema evolution problem more rigorously. They also produce smaller messages (roughly 30–80% smaller than equivalent JSON), which matters at high throughput.

The trade-off is that binary-encoded messages are not human-readable. You can't open them in a text editor. You need the schema to decode them. This is a real operational cost, but the tools around these formats (schema registries, generated code, inspection utilities) make it manageable.

Protocol Buffers — Field Numbers Are Everything

Protocol Buffers (usually called protobuf) was developed at Google and is one of the most widely used binary encoding formats. The key idea is simple but powerful: every field has a number, not just a name.

user.proto — a protobuf schema definition

syntax = "proto3";

message User {
  int64  user_id      = 1;
  string name         = 2;
  string email        = 3;
  string phone_number = 4;  // added in v2
}

When protobuf encodes a message, it writes each field as a tag-value pair, where the tag is the field number. It does not write the field name. The encoded bytes for a User might look like this conceptually:

Encoded representation (conceptual)

// field 1 (user_id), type: varint, value: 12345
0x08 0xB9 0x60

// field 2 (name), type: length-delimited, value: "Alice"
0x12 0x05 0x41 0x6C 0x69 0x63 0x65

// field 3 (email), type: length-delimited, value: "alice@example.com"
0x1A 0x11 ...

Notice: the name user_id is not in there. Only the number 1. This has a profound consequence for schema evolution.

Adding fields safely

When you add phone_number = 4, old code reading the encoded message will see a field with tag 4, not recognize it, and skip over it. This is forward compatibility — old code ignores fields it doesn't know.

New code reading an old message that doesn't have field 4 will see nothing for phone_number and use the default value (empty string in proto3). This is backward compatibility — new code handles missing fields gracefully.

Removing fields safely

You can remove a field from your schema. But you must never reuse its field number. If you remove field 3 and later add a completely different field as field 3, old encoded messages still have the old field 3 bytes. New code will read those bytes as the new field, which is nonsense data.

The convention is to mark removed fields as reserved:

Reserving removed fields to prevent accidental reuse

message User {
  reserved 3;          // field 3 (email) was removed in v3
  reserved "email";   // also reserve the name

  int64  user_id      = 1;
  string name         = 2;
  string phone_number = 4;
}

The protobuf compiler will reject any attempt to reuse field number 3 or the name email. This is a compile-time guard against a class of bugs that would be very hard to debug in production.

The required field trap

Proto2 had required fields. Proto3 removed them. This was a deliberate and correct decision. A required field sounds like a useful safety guarantee — "this field must always be present." But in a distributed system with rolling deployments and long-lived stored data, it's a trap.

If you add a required field, old code that doesn't know about this field will produce messages without it. New code that reads those messages will reject them as invalid. You've just made every old message invalid. Required fields are forever, which means you can never safely add one to a schema that already has data in production.

Key insight

The rule in protobuf is: every field is effectively optional, and defaults must be meaningful. If a missing field would cause incorrect behavior (not just an error you can handle), the real issue is in your application logic, not the schema. Put validation in your application layer, not the encoding layer.

Apache Avro — Schema Resolution at Read Time

Avro takes a fundamentally different approach to schema evolution. Where protobuf embeds a field number in every encoded value, Avro has no field identifiers at all. The encoded data is just values, back to back, in the order the schema defines them.

user.avsc — Avro schema (JSON format)

{
  "type": "record",
  "name": "User",
  "fields": [
    { "name": "user_id",  "type": "long" },
    { "name": "name",     "type": "string" },
    { "name": "email",    "type": "string" }
  ]
}

To decode Avro data you need two things: the schema the data was written with (the writer's schema) and the schema your code currently expects (the reader's schema). Avro's library reconciles the two.

Writer's schema (v1) Reader's schema (v2) ┌─────────────────────┐ ┌──────────────────────────────┐ │ user_id: long │──────▶│ user_id: long │ │ name: string │──────▶│ name: string │ │ email: string │──────▶│ email: string │ │ │ │ phone: string = "" ←default│ └─────────────────────┘ └──────────────────────────────┘ Avro matches fields by name. Fields in writer but not reader → ignored. Fields in reader but not writer → filled from default value.

Fig 19-2. Avro schema resolution: the writer's schema and reader's schema are reconciled field by field at read time

This design means Avro's encoded format is more compact than protobuf (no field numbers in the data). It also means schema evolution is very flexible — you can add fields, remove fields, rename fields — as long as you manage the writer/reader schema resolution correctly.

The catch: you must store the schema

Since there are no field numbers in Avro data, you cannot decode it without the writer's schema. This is the fundamental operational challenge with Avro.

There are three common approaches:

Store the schema in the file header — Avro's own file format (OCF) does this. The schema is in the first few bytes of every file. Good for batch data files.
Store a schema version number in the message — look up the full schema from a registry by version ID. Good for streaming data.
Agree on the schema out of band — if all producers and consumers are under your control and always updated together, you can reference a schema by a fixed name in configuration. Fragile in practice.

The second option is the most common in practice, and it leads us directly to schema registries.

The Schema Registry — Where Schemas Live

A schema registry is a service that stores versioned schemas and serves them to producers and consumers. The idea is straightforward, but it solves a problem that becomes genuinely painful without it: "how does the consumer know what schema version the producer used?"

Producer Schema Registry Consumer ──────── ─────────────── ──────── 1. Register schema ──────▶ Store schema v1 gets back ID=42 ◀────── return ID=42 2. Write message: [magic byte][ID=42][avro-encoded-payload] Send to Kafka ─────────────────────────────────────▶ 3. Read message prefix extract ID=42 4. Fetch schema ──▶ Lookup ID=42 ◀────────────── return schema v1 5. Decode payload using schema v1 ✓

Fig 19-3. How a schema registry fits into a Kafka-based pipeline. The producer registers the schema once; each message carries only the schema ID (5 bytes), not the full schema.

The Confluent Schema Registry is the most widely used implementation, and it works with both Avro and protobuf. The producer registers a schema and gets back a numeric ID. Each encoded message starts with a magic byte (0x00), followed by the 4-byte schema ID, followed by the actual encoded data. That's 5 bytes of overhead — very cheap.

Compatibility modes

A good schema registry doesn't just store schemas — it enforces compatibility rules. You configure a compatibility mode per subject (usually per-topic), and the registry rejects any schema registration that would violate the rules.

Mode	What it allows	What it prevents
BACKWARD	New schema can read data written by previous schema	Removing fields without defaults, changing types unsafely
FORWARD	Previous schema can read data written by new schema	Adding required fields, removing fields other code depends on
FULL	Both backward and forward compatible	Most unsafe changes
NONE	Any schema change allowed	Nothing — use only during development

For production systems handling long-lived data, FULL or at minimum BACKWARD is the right default. The registry becomes a safety net — it catches incompatible changes at schema registration time, before they ever reach production data.

Real-world note

The schema registry is also valuable as documentation. When a new engineer joins and wants to understand what data flows through your Kafka topics, the registry gives them a precise, up-to-date, versioned record of every schema in the system. This is much more reliable than a wiki page someone forgot to update six months ago.

API Versioning — Every Version Is a Support Contract

When your system exposes an HTTP API, you face the same evolvability challenge, but with an extra dimension: you don't control the clients. A mobile app on a user's phone might not be updated for months. A third-party integration might never update. You have to maintain backward compatibility for a long time.

The common approaches

URL versioning

GET /v1/users/12345
GET /v2/users/12345

This is the most visible approach. The version is in the URL, so it's obvious and cacheable. Every request tells you exactly what contract the client expects. The downside is that it creates a hard fork — v1 and v2 are separate code paths, and you have to maintain both. Clients don't migrate automatically; you have to deprecate old versions and then enforce the deprecation.

Header versioning

GET /users/12345
Accept: application/vnd.myapi+json; version=2

The URL stays stable and the version moves to a header. Purists prefer this because the URL represents the resource, not a snapshot of your API contract. In practice, it's harder to test (you can't just paste a URL into a browser) and harder for clients to discover.

Query parameter versioning

GET /users/12345?version=2

Easy to test and discover, but query parameters are semantically wrong — they should filter resources, not select an API contract. Also easy to accidentally drop, leading to mysterious behavior changes.

What actually needs a version bump?

Not every change needs a new version. Here's the rule: if old clients will continue working correctly with the change, it doesn't need a version bump.

Adding a new optional field to a response — old clients ignore it. No version bump needed.
Adding a new endpoint — old clients don't call it. No version bump needed.
Removing a field from a response — if any client depends on it, version bump required. If you're certain nobody uses it (check your logs), it might be safe.
Changing the type or semantics of a field — almost always requires a version bump.
Changing error response format — if clients parse errors, this needs a version bump.

The deprecation lifecycle — plan it before you ship v1

This is where most teams fail. They ship v1, then ship v2 because v1 had problems, then spend the next two years keeping v1 alive because some clients never migrated. The fix is to treat deprecation as a first-class part of your versioning strategy:

Set a minimum support window when you publish a version. "This version will be supported for 18 months" is a reasonable commitment.
Add a Deprecation header to responses from deprecated versions. Some HTTP clients can surface this to operators automatically.
Track who is actually calling old versions. Your API gateway logs should tell you the last time any client called /v1/. When that's zero for 30 days, you can remove it confidently.
Give clients an explicit sunset date and stick to it. Hard cutoffs are kinder than infinite maintenance burden.

The hidden cost

Every API version you publish is a maintenance tax. You have to test it, document it, run it, and debug production issues on it — potentially forever. Three major versions running simultaneously is usually the maximum that's operationally sustainable. If you're beyond that, you're accumulating debt faster than you're shipping features.

Database Migrations in Live Systems

Changing a database schema while the application is running is one of the most nerve-wracking things in backend engineering. On a small table it's trivial. On a table with 500 million rows that gets thousands of writes per second, it's a careful, multi-step process that can take weeks.

The naive approach and why it fails

The naive approach is: write a migration script, put the application in maintenance mode, run the script, bring the application back up. This works perfectly at small scale. It stops being acceptable when:

The table is large enough that the migration takes hours
The application has an SLA that doesn't allow hours of downtime
The migration holds locks that block reads and writes

Most ALTER TABLE operations in MySQL and PostgreSQL take a lock on the table for the duration. Adding a column to a 1-billion-row table with a default value may need to rewrite the entire table on disk. That could take hours, and during that time, your application is effectively down.

The expand-contract pattern

The expand-contract pattern (also called parallel change) solves this by splitting every schema change into three phases that happen over multiple deployments.

Phase 1: EXPAND ───────────────────────────────────────────────────── Add the new column (or table) alongside the old one. New column is nullable / has a default. Old code: ignores new column, writes to old column. New code: writes to BOTH old and new column. Phase 2: MIGRATE ───────────────────────────────────────────────────── Backfill the new column for all existing rows. Run in batches (e.g. 10,000 rows at a time) to avoid long-running transactions and lock contention. Do this as a background job, not a blocking migration. Phase 3: CONTRACT ───────────────────────────────────────────────────── All code now reads/writes only the new column. Old column is no longer written. Drop the old column. (Drop is fast — it's just a metadata change in Postgres.)

Fig 19-4. The expand-contract pattern. Each phase is a separate deployment. No deployment requires downtime.

Let's make this concrete with an example. Suppose you want to rename the column user_name to full_name. Here's the timeline:

Deployment 1 — Expand

-- Add new column without removing old one
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- (fast — no data rewrite, column is nullable)

Update application code to write to both user_name and full_name. Read from user_name (the reliable source, since full_name is still mostly empty).

Deployment 2 — Migrate (background job)

-- Backfill in batches, not a single UPDATE
UPDATE users
SET    full_name = user_name
WHERE  full_name IS NULL
  AND  id BETWEEN :start_id AND :end_id;

Run this as a background job in batches of 10,000–50,000 rows with a small sleep between batches to avoid hammering the database. This can take hours or days on a large table, but it doesn't block anything.

Deployment 3 — Contract (switch reads)

Once the backfill is complete and verified, update application code to read from full_name instead of user_name. Keep writing to both for now (in case of rollback).

Deployment 4 — Contract (remove old column)

-- Only do this once you're certain no code reads user_name
ALTER TABLE users DROP COLUMN user_name;

This final step is usually fast in modern databases — it's a metadata change. In PostgreSQL, dropped columns don't immediately reclaim space; a subsequent VACUUM or REPACK handles that.

The NOT NULL problem

Adding a NOT NULL column to an existing table is a common source of pain. In PostgreSQL before version 11, adding a NOT NULL column with a default required rewriting the entire table. In PostgreSQL 11+, adding a column with a constant default is fast (the default is stored as metadata, not written to each row). But if your default is dynamic — like NOW() — you still need the expand-migrate-contract approach.

The safe pattern is always: add the column as nullable first, backfill, then add the NOT NULL constraint once all rows have values.

Safe way to add a NOT NULL column

-- Step 1: add nullable (fast)
ALTER TABLE orders ADD COLUMN region VARCHAR(50);

-- Step 2: backfill (run as background job)
UPDATE orders SET region = 'us-east-1' WHERE region IS NULL;

-- Step 3: add constraint (fast in pg14+ — validates without full lock)
ALTER TABLE orders
  ADD CONSTRAINT orders_region_not_null
  CHECK (region IS NOT NULL) NOT VALID;

-- Step 4: validate (runs concurrently, no lock)
ALTER TABLE orders
  VALIDATE CONSTRAINT orders_region_not_null;

The Strangler Fig Pattern — Replacing a Live System

Sometimes the problem isn't a schema change on a single table. Sometimes you need to replace an entire system — a legacy service with years of accumulated complexity — while it continues handling production traffic. The strangler fig pattern is the right way to do this.

The name comes from a type of tropical tree that grows around a host tree, gradually encasing it, until eventually the host dies and the strangler fig stands on its own. Martin Fowler coined the term in 2004, and it has become one of the most widely applied patterns in large-scale system migrations.

┌────────────────────────────────────────────────────────────┐ │ Router / Proxy │ └────────────────┬───────────────────────────────────────────┘ │ ┌─────────┴──────────┐ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Legacy │ │ New │ │ System │ │ System │ └──────────┘ └──────────┘ Phase 1: 100% → Legacy Phase 2: 90% → Legacy, 10% → New Phase 3: 50% → Legacy, 50% → New Phase 4: 0% → Legacy, 100% → New Phase 5: Decommission Legacy

Fig 19-5. The strangler fig: a proxy intercepts traffic and routes it between old and new system. Traffic shifts incrementally until the old system handles nothing and can be removed.

How it works in practice

Step 1 — Put a proxy in front. Before you write a single line of the new system, insert a routing layer between callers and the legacy system. At this point, 100% of traffic still goes to the old system — nothing has changed. But you now have a control point.

Step 2 — Build the new system in parallel. The new system doesn't have to handle everything from day one. Start with the simplest or most well-understood feature. The proxy routes that feature's traffic to the new system; everything else still hits the old one.

Step 3 — Shift traffic incrementally. Feature by feature, endpoint by endpoint, route more traffic to the new system. At each step you can measure correctness, performance, and error rates before proceeding. The old system is still running and can catch anything that falls through.

Step 4 — Shadow mode for risky parts. For the parts of the system where you can't afford errors (payment processing, inventory deduction), run in shadow mode first: send the request to both old and new systems, use the old system's response, but compare results. When the two systems agree consistently, switch to using the new system's response.

Step 5 — Decommission the legacy system. Once 100% of traffic is handled by the new system and you've run that way for long enough to trust it (typically 2–4 weeks minimum), you can finally shut down the old system. This step often takes much longer than teams expect — someone always discovers a dormant consumer that they forgot about.

The data migration problem

The strangler fig handles the traffic routing problem. But if the old and new systems use different data stores, you have a second problem: keeping data in sync while both systems are live.

The most reliable approach is the dual-write with reconciliation pattern:

All writes go to both old and new data stores (either from the application or via change data capture from the old store).
A reconciliation job periodically compares the two stores and reports discrepancies.
When discrepancies reach zero consistently, you can cut over reads to the new store.
Stop dual-writing only after the old store is fully decommissioned.

The most common failure mode

Teams do the hard part of the strangler fig well — they build the new system, shift traffic, validate it. Then they never complete step 5. The old system just keeps running. They pay for two systems indefinitely, the old codebase accumulates bitrot, and engineers are still afraid to touch it three years later. Set a hard decommission date before you start the migration. Put it in writing. Get stakeholder buy-in. Decommissioning is not optional maintenance; it's the payoff for the entire project.

When not to use the strangler fig

The strangler fig requires that traffic can be routed at a granular enough level to migrate piece by piece. This works well for HTTP APIs (route by path), message queues (route by topic or message type), and batch jobs (migrate job by job). It works less well when:

The old system is deeply stateful and state cannot be migrated incrementally
The old and new systems have fundamentally different data models with no clean mapping
The system is a monolithic binary with no seams to insert a proxy into

In these cases, you may need a more aggressive approach — a fixed cutover date with heavy testing beforehand, and a rapid rollback plan if it fails.

Chapter Summary

Evolvability is the ability of a system to change — schema, API, or implementation — without breaking the things that depend on it. It doesn't happen by accident; it has to be designed in from the beginning.

The core rules

New code must read old data (backward compat)
Old code must survive new data (forward compat)
Never reuse a deleted field's number or name
Required fields in schemas are almost always a mistake
Every API version you ship is a support liability
Plan deprecation before you ship version one

The key techniques

Protobuf field tags for safe binary evolution
Avro schema resolution (writer + reader schema)
Schema registry for versioned schema storage
Expand-contract for database schema changes
Batch backfills for large table migrations
Strangler fig for replacing live systems

Three questions for your next design review

If we deploy new code today and need to roll it back tomorrow, will the data written by new code be readable by the old code?
Which fields in our schema or API response do consumers actually depend on, and what is our plan for removing or changing them?
If we are replacing an existing system, what is the specific, dated plan to decommission the old one?