Evolvability — Designing for Change You Can't Predict
You can deploy new code any time. You cannot undeploy old data.
Software changes constantly. New features get added, requirements shift, bugs get fixed. That's normal. The hard part is that your data doesn't change just because your code does.
In this chapter we look at how to write systems that can evolve over time without breaking. We start with the fundamental problem: old code and new code have to coexist and talk to each other. Then we work through the tools that solve this — encoding formats like Protocol Buffers and Avro, schema registries, API versioning strategies, database migration techniques, and the strangler fig pattern for replacing live systems without stopping the world.
By the end you'll have a clear mental model for what "backward compatible" and "forward compatible" really mean, and a practical toolkit for making changes safely.
- Backward compatibility means new code can read old data. Forward compatibility means old code can read new data. You need both during a rolling deployment.
- JSON is deceptively fragile for long-lived data — no schema enforcement, silent type coercions, and nothing to guide evolution.
- Protocol Buffers use numbered field tags instead of names. As long as you never reuse a tag number, you can add and remove fields safely.
- Avro has no field tags at all. It matches fields by name at read time using a writer's schema and a reader's schema — which means you must always store or transmit the schema alongside the data.
- A schema registry solves the "where does the schema live?" problem. Producers register schemas, get back an ID, and embed only the ID in each message.
- Every API version you publish is a support contract — plan version deprecation before you ship version one.
- The expand-contract pattern is how you change a database schema without downtime: add first, migrate data, remove old column only when no code references it.
- The strangler fig pattern is how you replace a whole system: route traffic to the new system incrementally, never stopping the old one until the new one handles everything.
The Core Problem: Your Data Outlives Your Code
Here is a situation every engineer faces eventually. You have a service running in production. It writes records to a database. The records look like this:
Current record format{
"user_id": 12345,
"name": "Alice",
"email": "alice@example.com"
}
Now you need to add a new field: phone_number. You update your code, test it, and deploy. But here's the thing: you still have millions of old records in the database that don't have the phone_number field. And your deployment is rolling — at any moment, half your servers are running old code and half are running new code.
This is the evolvability problem. It has three dimensions:
- Old data, new code: Your new code reads a record that was written before
phone_numberexisted. What does it do with the missing field? - New data, old code: Your old code reads a record that a new server just wrote, complete with
phone_number. What does it do with the field it doesn't understand? - Old code, new API contract: A mobile client on version 3 hits your new version 4 API endpoint. Does it crash?
If you handle these three cases well, you can deploy continuously, roll back if needed, and change your data model without coordinated downtime. If you don't, you end up in a situation where every schema change requires a "big bang" deploy where all services update at exactly the same moment — which is stressful, risky, and gets worse as your system grows.
This problem is much older than microservices. Even a single application with a database faces it. The moment you store data on disk, you've created a contract between today's code and tomorrow's code.
Backward and Forward Compatibility
These two terms are used constantly but often confused. Let's define them precisely.
Backward compatibility means that newer code can read data written by older code. This is the direction most engineers think about first, and it's usually easier to achieve. When you add a new optional field, old records that don't have that field are still valid — new code just sees a null or a default.
Forward compatibility means that older code can read data written by newer code. This is the trickier direction. When new code writes a record with a new field, can old code read it without breaking? Usually the answer is "yes, if the old code just ignores fields it doesn't know about." But this requires that your serialization format actually supports ignoring unknown fields — and not all of them do.
During a rolling deployment, you need both directions at the same time. New servers are writing new data that old servers might read. Old servers are writing old data that new servers will read. The system works correctly only if both directions hold.
What breaks compatibility?
Some changes are safe. Others are not. Here is a rough guide:
| Change | Backward compat? | Forward compat? | Notes |
|---|---|---|---|
| Add an optional field | ✓ Yes | ✓ Yes (if old code ignores unknowns) | Safe in most cases |
| Remove an optional field | ✓ Yes | ✓ Yes (old code writes it, new code ignores it) | Watch for consumers that depend on it |
| Rename a field | ✗ No | ✗ No | Equivalent to remove + add. Never do this atomically. |
| Change a field's type | Maybe | Maybe | int32 → int64 is safe in some formats, not others |
| Add a required field | ✗ No | ✗ No | Old data won't have it. Required fields are nearly always a mistake. |
| Change a field's meaning | ✗ No | ✗ No | Even if the name and type stay the same, semantics matter |
Renaming a field feels harmless in the code editor — it's just a string. But it's the most dangerous schema change you can make. To old readers, the old name is gone. To new readers, the new name is new. No reader can handle both unless you explicitly run both names in parallel for a transition period.
Encoding Formats Matter More Than You Think
The format you use to encode your data determines how much flexibility you have when things change. Let's look at the main options, starting with the most common and working toward the most rigorous.
JSON and XML — Flexible but Fragile
JSON is the default choice for most HTTP APIs and many data pipelines, and for good reason: it's human-readable, every language supports it, and you can open a message in a text editor. But JSON has several properties that make schema evolution harder than it looks.
First, JSON has no schema enforcement. Any key can appear in any document, and nobody checks. This sounds flexible but it means silent corruption is easy — a field gets misspelled, a type gets changed, and nobody finds out until production.
Second, JSON's type system is shallow. There are numbers, strings, booleans, arrays, and objects. There is no distinction between a 32-bit integer and a 64-bit one. JavaScript's JSON.parse will silently lose precision on integers larger than 2^53. This has caused real bugs in financial systems.
Third, JSON has no concept of schema versions. If you want to know what version of your schema a particular document conforms to, you have to add that information yourself, and you have to write code to act on it yourself.
None of this means you shouldn't use JSON. For short-lived data — an HTTP request and its response — these limitations rarely matter. Where they bite you is in long-lived storage or high-volume message passing where you need reliable schema evolution over months and years.
Binary formats — when throughput and structure matter
Binary encoding formats like Protocol Buffers, Apache Avro, and Apache Thrift solve the schema evolution problem more rigorously. They also produce smaller messages (roughly 30–80% smaller than equivalent JSON), which matters at high throughput.
The trade-off is that binary-encoded messages are not human-readable. You can't open them in a text editor. You need the schema to decode them. This is a real operational cost, but the tools around these formats (schema registries, generated code, inspection utilities) make it manageable.
Protocol Buffers — Field Numbers Are Everything
Protocol Buffers (usually called protobuf) was developed at Google and is one of the most widely used binary encoding formats. The key idea is simple but powerful: every field has a number, not just a name.
user.proto — a protobuf schema definitionsyntax = "proto3";
message User {
int64 user_id = 1;
string name = 2;
string email = 3;
string phone_number = 4; // added in v2
}
When protobuf encodes a message, it writes each field as a tag-value pair, where the tag is the field number. It does not write the field name. The encoded bytes for a User might look like this conceptually:
Encoded representation (conceptual)// field 1 (user_id), type: varint, value: 12345
0x08 0xB9 0x60
// field 2 (name), type: length-delimited, value: "Alice"
0x12 0x05 0x41 0x6C 0x69 0x63 0x65
// field 3 (email), type: length-delimited, value: "alice@example.com"
0x1A 0x11 ...
Notice: the name user_id is not in there. Only the number 1. This has a profound consequence for schema evolution.
Adding fields safely
When you add phone_number = 4, old code reading the encoded message will see a field with tag 4, not recognize it, and skip over it. This is forward compatibility — old code ignores fields it doesn't know.
New code reading an old message that doesn't have field 4 will see nothing for phone_number and use the default value (empty string in proto3). This is backward compatibility — new code handles missing fields gracefully.
Removing fields safely
You can remove a field from your schema. But you must never reuse its field number. If you remove field 3 and later add a completely different field as field 3, old encoded messages still have the old field 3 bytes. New code will read those bytes as the new field, which is nonsense data.
The convention is to mark removed fields as reserved:
message User {
reserved 3; // field 3 (email) was removed in v3
reserved "email"; // also reserve the name
int64 user_id = 1;
string name = 2;
string phone_number = 4;
}
The protobuf compiler will reject any attempt to reuse field number 3 or the name email. This is a compile-time guard against a class of bugs that would be very hard to debug in production.
The required field trap
Proto2 had required fields. Proto3 removed them. This was a deliberate and correct decision. A required field sounds like a useful safety guarantee — "this field must always be present." But in a distributed system with rolling deployments and long-lived stored data, it's a trap.
If you add a required field, old code that doesn't know about this field will produce messages without it. New code that reads those messages will reject them as invalid. You've just made every old message invalid. Required fields are forever, which means you can never safely add one to a schema that already has data in production.
The rule in protobuf is: every field is effectively optional, and defaults must be meaningful. If a missing field would cause incorrect behavior (not just an error you can handle), the real issue is in your application logic, not the schema. Put validation in your application layer, not the encoding layer.
Apache Avro — Schema Resolution at Read Time
Avro takes a fundamentally different approach to schema evolution. Where protobuf embeds a field number in every encoded value, Avro has no field identifiers at all. The encoded data is just values, back to back, in the order the schema defines them.
user.avsc — Avro schema (JSON format){
"type": "record",
"name": "User",
"fields": [
{ "name": "user_id", "type": "long" },
{ "name": "name", "type": "string" },
{ "name": "email", "type": "string" }
]
}
To decode Avro data you need two things: the schema the data was written with (the writer's schema) and the schema your code currently expects (the reader's schema). Avro's library reconciles the two.
This design means Avro's encoded format is more compact than protobuf (no field numbers in the data). It also means schema evolution is very flexible — you can add fields, remove fields, rename fields — as long as you manage the writer/reader schema resolution correctly.
The catch: you must store the schema
Since there are no field numbers in Avro data, you cannot decode it without the writer's schema. This is the fundamental operational challenge with Avro.
There are three common approaches:
- Store the schema in the file header — Avro's own file format (OCF) does this. The schema is in the first few bytes of every file. Good for batch data files.
- Store a schema version number in the message — look up the full schema from a registry by version ID. Good for streaming data.
- Agree on the schema out of band — if all producers and consumers are under your control and always updated together, you can reference a schema by a fixed name in configuration. Fragile in practice.
The second option is the most common in practice, and it leads us directly to schema registries.
The Schema Registry — Where Schemas Live
A schema registry is a service that stores versioned schemas and serves them to producers and consumers. The idea is straightforward, but it solves a problem that becomes genuinely painful without it: "how does the consumer know what schema version the producer used?"
The Confluent Schema Registry is the most widely used implementation, and it works with both Avro and protobuf. The producer registers a schema and gets back a numeric ID. Each encoded message starts with a magic byte (0x00), followed by the 4-byte schema ID, followed by the actual encoded data. That's 5 bytes of overhead — very cheap.
Compatibility modes
A good schema registry doesn't just store schemas — it enforces compatibility rules. You configure a compatibility mode per subject (usually per-topic), and the registry rejects any schema registration that would violate the rules.
| Mode | What it allows | What it prevents |
|---|---|---|
| BACKWARD | New schema can read data written by previous schema | Removing fields without defaults, changing types unsafely |
| FORWARD | Previous schema can read data written by new schema | Adding required fields, removing fields other code depends on |
| FULL | Both backward and forward compatible | Most unsafe changes |
| NONE | Any schema change allowed | Nothing — use only during development |
For production systems handling long-lived data, FULL or at minimum BACKWARD is the right default. The registry becomes a safety net — it catches incompatible changes at schema registration time, before they ever reach production data.
The schema registry is also valuable as documentation. When a new engineer joins and wants to understand what data flows through your Kafka topics, the registry gives them a precise, up-to-date, versioned record of every schema in the system. This is much more reliable than a wiki page someone forgot to update six months ago.
API Versioning — Every Version Is a Support Contract
When your system exposes an HTTP API, you face the same evolvability challenge, but with an extra dimension: you don't control the clients. A mobile app on a user's phone might not be updated for months. A third-party integration might never update. You have to maintain backward compatibility for a long time.
The common approaches
URL versioning
GET /v1/users/12345
GET /v2/users/12345
This is the most visible approach. The version is in the URL, so it's obvious and cacheable. Every request tells you exactly what contract the client expects. The downside is that it creates a hard fork — v1 and v2 are separate code paths, and you have to maintain both. Clients don't migrate automatically; you have to deprecate old versions and then enforce the deprecation.
Header versioning
GET /users/12345
Accept: application/vnd.myapi+json; version=2
The URL stays stable and the version moves to a header. Purists prefer this because the URL represents the resource, not a snapshot of your API contract. In practice, it's harder to test (you can't just paste a URL into a browser) and harder for clients to discover.
Query parameter versioning
GET /users/12345?version=2
Easy to test and discover, but query parameters are semantically wrong — they should filter resources, not select an API contract. Also easy to accidentally drop, leading to mysterious behavior changes.
What actually needs a version bump?
Not every change needs a new version. Here's the rule: if old clients will continue working correctly with the change, it doesn't need a version bump.
- Adding a new optional field to a response — old clients ignore it. No version bump needed.
- Adding a new endpoint — old clients don't call it. No version bump needed.
- Removing a field from a response — if any client depends on it, version bump required. If you're certain nobody uses it (check your logs), it might be safe.
- Changing the type or semantics of a field — almost always requires a version bump.
- Changing error response format — if clients parse errors, this needs a version bump.
The deprecation lifecycle — plan it before you ship v1
This is where most teams fail. They ship v1, then ship v2 because v1 had problems, then spend the next two years keeping v1 alive because some clients never migrated. The fix is to treat deprecation as a first-class part of your versioning strategy:
- Set a minimum support window when you publish a version. "This version will be supported for 18 months" is a reasonable commitment.
- Add a
Deprecationheader to responses from deprecated versions. Some HTTP clients can surface this to operators automatically. - Track who is actually calling old versions. Your API gateway logs should tell you the last time any client called
/v1/. When that's zero for 30 days, you can remove it confidently. - Give clients an explicit sunset date and stick to it. Hard cutoffs are kinder than infinite maintenance burden.
Every API version you publish is a maintenance tax. You have to test it, document it, run it, and debug production issues on it — potentially forever. Three major versions running simultaneously is usually the maximum that's operationally sustainable. If you're beyond that, you're accumulating debt faster than you're shipping features.
Database Migrations in Live Systems
Changing a database schema while the application is running is one of the most nerve-wracking things in backend engineering. On a small table it's trivial. On a table with 500 million rows that gets thousands of writes per second, it's a careful, multi-step process that can take weeks.
The naive approach and why it fails
The naive approach is: write a migration script, put the application in maintenance mode, run the script, bring the application back up. This works perfectly at small scale. It stops being acceptable when:
- The table is large enough that the migration takes hours
- The application has an SLA that doesn't allow hours of downtime
- The migration holds locks that block reads and writes
Most ALTER TABLE operations in MySQL and PostgreSQL take a lock on the table for the duration. Adding a column to a 1-billion-row table with a default value may need to rewrite the entire table on disk. That could take hours, and during that time, your application is effectively down.
The expand-contract pattern
The expand-contract pattern (also called parallel change) solves this by splitting every schema change into three phases that happen over multiple deployments.
Let's make this concrete with an example. Suppose you want to rename the column user_name to full_name. Here's the timeline:
Deployment 1 — Expand
-- Add new column without removing old one
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- (fast — no data rewrite, column is nullable)
Update application code to write to both user_name and full_name. Read from user_name (the reliable source, since full_name is still mostly empty).
Deployment 2 — Migrate (background job)
-- Backfill in batches, not a single UPDATE
UPDATE users
SET full_name = user_name
WHERE full_name IS NULL
AND id BETWEEN :start_id AND :end_id;
Run this as a background job in batches of 10,000–50,000 rows with a small sleep between batches to avoid hammering the database. This can take hours or days on a large table, but it doesn't block anything.
Deployment 3 — Contract (switch reads)
Once the backfill is complete and verified, update application code to read from full_name instead of user_name. Keep writing to both for now (in case of rollback).
Deployment 4 — Contract (remove old column)
-- Only do this once you're certain no code reads user_name
ALTER TABLE users DROP COLUMN user_name;
This final step is usually fast in modern databases — it's a metadata change. In PostgreSQL, dropped columns don't immediately reclaim space; a subsequent VACUUM or REPACK handles that.
The NOT NULL problem
Adding a NOT NULL column to an existing table is a common source of pain. In PostgreSQL before version 11, adding a NOT NULL column with a default required rewriting the entire table. In PostgreSQL 11+, adding a column with a constant default is fast (the default is stored as metadata, not written to each row). But if your default is dynamic — like NOW() — you still need the expand-migrate-contract approach.
The safe pattern is always: add the column as nullable first, backfill, then add the NOT NULL constraint once all rows have values.
Safe way to add a NOT NULL column-- Step 1: add nullable (fast)
ALTER TABLE orders ADD COLUMN region VARCHAR(50);
-- Step 2: backfill (run as background job)
UPDATE orders SET region = 'us-east-1' WHERE region IS NULL;
-- Step 3: add constraint (fast in pg14+ — validates without full lock)
ALTER TABLE orders
ADD CONSTRAINT orders_region_not_null
CHECK (region IS NOT NULL) NOT VALID;
-- Step 4: validate (runs concurrently, no lock)
ALTER TABLE orders
VALIDATE CONSTRAINT orders_region_not_null;
The Strangler Fig Pattern — Replacing a Live System
Sometimes the problem isn't a schema change on a single table. Sometimes you need to replace an entire system — a legacy service with years of accumulated complexity — while it continues handling production traffic. The strangler fig pattern is the right way to do this.
The name comes from a type of tropical tree that grows around a host tree, gradually encasing it, until eventually the host dies and the strangler fig stands on its own. Martin Fowler coined the term in 2004, and it has become one of the most widely applied patterns in large-scale system migrations.
How it works in practice
Step 1 — Put a proxy in front. Before you write a single line of the new system, insert a routing layer between callers and the legacy system. At this point, 100% of traffic still goes to the old system — nothing has changed. But you now have a control point.
Step 2 — Build the new system in parallel. The new system doesn't have to handle everything from day one. Start with the simplest or most well-understood feature. The proxy routes that feature's traffic to the new system; everything else still hits the old one.
Step 3 — Shift traffic incrementally. Feature by feature, endpoint by endpoint, route more traffic to the new system. At each step you can measure correctness, performance, and error rates before proceeding. The old system is still running and can catch anything that falls through.
Step 4 — Shadow mode for risky parts. For the parts of the system where you can't afford errors (payment processing, inventory deduction), run in shadow mode first: send the request to both old and new systems, use the old system's response, but compare results. When the two systems agree consistently, switch to using the new system's response.
Step 5 — Decommission the legacy system. Once 100% of traffic is handled by the new system and you've run that way for long enough to trust it (typically 2–4 weeks minimum), you can finally shut down the old system. This step often takes much longer than teams expect — someone always discovers a dormant consumer that they forgot about.
The data migration problem
The strangler fig handles the traffic routing problem. But if the old and new systems use different data stores, you have a second problem: keeping data in sync while both systems are live.
The most reliable approach is the dual-write with reconciliation pattern:
- All writes go to both old and new data stores (either from the application or via change data capture from the old store).
- A reconciliation job periodically compares the two stores and reports discrepancies.
- When discrepancies reach zero consistently, you can cut over reads to the new store.
- Stop dual-writing only after the old store is fully decommissioned.
Teams do the hard part of the strangler fig well — they build the new system, shift traffic, validate it. Then they never complete step 5. The old system just keeps running. They pay for two systems indefinitely, the old codebase accumulates bitrot, and engineers are still afraid to touch it three years later. Set a hard decommission date before you start the migration. Put it in writing. Get stakeholder buy-in. Decommissioning is not optional maintenance; it's the payoff for the entire project.
When not to use the strangler fig
The strangler fig requires that traffic can be routed at a granular enough level to migrate piece by piece. This works well for HTTP APIs (route by path), message queues (route by topic or message type), and batch jobs (migrate job by job). It works less well when:
- The old system is deeply stateful and state cannot be migrated incrementally
- The old and new systems have fundamentally different data models with no clean mapping
- The system is a monolithic binary with no seams to insert a proxy into
In these cases, you may need a more aggressive approach — a fixed cutover date with heavy testing beforehand, and a rapid rollback plan if it fails.
Chapter Summary
Evolvability is the ability of a system to change — schema, API, or implementation — without breaking the things that depend on it. It doesn't happen by accident; it has to be designed in from the beginning.
The core rules
- New code must read old data (backward compat)
- Old code must survive new data (forward compat)
- Never reuse a deleted field's number or name
- Required fields in schemas are almost always a mistake
- Every API version you ship is a support liability
- Plan deprecation before you ship version one
The key techniques
- Protobuf field tags for safe binary evolution
- Avro schema resolution (writer + reader schema)
- Schema registry for versioned schema storage
- Expand-contract for database schema changes
- Batch backfills for large table migrations
- Strangler fig for replacing live systems
Three questions for your next design review
- If we deploy new code today and need to roll it back tomorrow, will the data written by new code be readable by the old code?
- Which fields in our schema or API response do consumers actually depend on, and what is our plan for removing or changing them?
- If we are replacing an existing system, what is the specific, dated plan to decommission the old one?