Part VI · Chapter 23 · Extensibility

API Design as a Distributed Systems Problem

Your code will change. Your data model will change. Your team will change. But once you publish an API and someone depends on it, that interface becomes one of the hardest things in your system to change. This chapter is about designing interfaces that can evolve — and the mistakes that make them impossible to change without breaking callers.

What's Coming in This Chapter

We start with a fundamental observation: an API is a distributed systems problem, not just a software design problem. The moment a second service calls your API, you have a coordination problem that code alone can't solve.

We'll look at the three main protocol choices — REST, gRPC, and GraphQL — not to pick a winner but to understand what each one costs and what it buys. Then we go deep on the things that actually break APIs in practice: versioning, pagination, error design, and backward compatibility. We close with a concrete checklist for reviewing any API before you publish it.

Key Learnings — Read This Before the Chapter, Review It After

  1. An API is a contract with your callers. Breaking it silently is the same as introducing a bug into every system that depends on you.
  2. REST, gRPC, and GraphQL are not competing philosophies — they solve different problems. Choosing the wrong one for your context costs more than just a migration.
  3. The most dangerous part of an API is not the endpoints you add — it is the fields and behaviors callers start depending on without realizing it.
  4. API versioning always feels manageable when you have one version. It becomes a maintenance burden at three versions, and a graveyard at five.
  5. Pagination is not a detail. A missing or broken pagination design has taken down production systems that are completely unrelated to the paginating service.
  6. "Idempotent by design" is the single most important property for any API that mutates state. It makes retries safe, distributed transactions simpler, and debugging tractable.
  7. Error responses deserve as much design effort as success responses. A vague error message forces callers to guess — and they will guess wrong.
  8. Rate limiting is not just about protecting your service. It is how you communicate to callers what "normal" usage looks like — and punishing legitimate callers who don't know the rules is a failure of API design.
  9. Backward compatibility is binary. Either every existing caller still works without changes, or you broke the API. There is no "mostly compatible."
  10. The best time to think about API evolution is before you publish the API. The second best time is before you add a field — not after it has been in production for a year.

1. The Interface Is the Most Permanent Thing You'll Build

When you change an internal function, you update the callers and move on. When you change a database table, you write a migration. Both of these are local problems — you control both sides of the change.

When you change an API, you have a coordination problem. You don't control your callers. They run on their own deploy cycles. Some of them are mobile apps that users haven't updated in two years. Some are third-party integrations you didn't know existed. Some are internal services whose owners changed jobs and nobody knows who to call.

This is why API design is a distributed systems problem. The interface between two services is not a software abstraction — it is a contract across an organizational and temporal boundary. Breaking it silently is equivalent to introducing a bug in every caller's system at once.

The practical implication: the moment you publish an API and someone depends on it, your ability to change that API freely is gone. You can still change it — but every change now has a cost that is paid by systems you don't fully control.

The Hyrum's Law Problem

Given enough users of an API, it does not matter what you promised in the documentation. All observable behaviors of your system will be depended upon by somebody. If your API returns a list in alphabetical order purely by accident, some caller is sorting by that assumption. If you fix the ordering, their code breaks — even though you never promised an order.

The lesson: every observable behavior of your API is a potential implicit contract. The only way to manage this is to be very explicit about what you guarantee and what you don't.

Code is temporary; interfaces are permanent

Most engineers spend more time thinking about the code behind an endpoint than the shape of the endpoint itself. This is backwards. You can rewrite the code behind an endpoint without touching callers. You cannot change the shape of the endpoint without affecting every caller.

Think of the interface as the decision that is hardest to reverse. It deserves proportionally more design time upfront. A useful heuristic: spend as much time designing the API shape as you would spend writing the initial implementation. Before you write a single line of handler code, ask: If I were a caller of this API, six months from now, would this shape still make sense?

2. Choosing a Protocol: REST, gRPC, or GraphQL

Let's clear something up first: there is no universally best protocol. REST, gRPC, and GraphQL were each built to solve a different problem. Using the wrong one for your context is not just a philosophical mistake — it creates real operational pain.

Protocol Best For Main Trade-off Typical Users
REST Public APIs, simple CRUD, browser-native Over-fetching, multiple round-trips for related data External callers, mobile clients, third parties
gRPC Internal service-to-service, high throughput, streaming Not human-readable, harder to debug, browser support is poor Internal microservices, backend systems
GraphQL Flexible queries, product APIs with diverse clients Complexity, N+1 queries, caching is hard, attack surface BFF (backend-for-frontend), product teams with many clients

Many organizations end up using all three in the right places: gRPC for internal service-to-service calls, REST for external public APIs, and GraphQL for the product-facing layer where different clients (web, iOS, Android) need different shapes of data. This is not inconsistency — it's appropriate tool selection.

3. REST in Depth

REST is the most widely used API style, and also the most widely misunderstood. Most "REST" APIs are actually just HTTP APIs with JSON. True REST — as Roy Fielding defined it — has six architectural constraints, one of which (HATEOAS — Hypermedia As The Engine Of Application State) almost nobody follows.

In practice, when people say "REST API," they mean: resources identified by URLs, HTTP verbs to express intent, JSON bodies, and stateless interactions. This is pragmatic and it works. The important question is not whether you're "truly RESTful" — it's whether you're consistent.

Resource modeling: nouns, not verbs

The most common REST design mistake is using URLs as a place to put verbs. URLs should identify things (resources), and HTTP verbs should express what you want to do with them.

Avoid — verb-based URLs

POST /createUser

GET /getUserById?id=123

POST /disableUserAccount

Prefer — resource-based URLs

POST /users

GET /users/123

PATCH /users/123 with body {"status": "disabled"}

There are cases where the action-based model is genuinely better — particularly for operations that don't map cleanly onto CRUD. Sending a message, running a job, transferring money between accounts — these are events, not resource mutations. In these cases, it is acceptable to model the action as a resource itself:

POST /accounts/123/transfers — creates a transfer resource, rather than pretending it is a simple update to the account.

HTTP verbs and their semantics

HTTP verbs carry semantic meaning that clients and infrastructure depend on. Using them incorrectly causes subtle bugs — particularly around caching and retries.

  • GET — safe and idempotent. Never causes side effects. Responses can be cached. Never use GET for an operation that changes state.
  • POST — not safe, not idempotent by default. Use for creation or actions. The server decides the resource URL.
  • PUT — idempotent. Replaces the entire resource. Sending the same PUT twice must produce the same result.
  • PATCH — partial update. Not required to be idempotent (though it can be, and you should try to make it so).
  • DELETE — idempotent. Deleting something that is already deleted should return 204, not 404 (because the postcondition — the thing doesn't exist — is satisfied).
The GET-with-side-effects Trap

Using GET for an operation that changes state is more than a design error — it is a reliability hazard. HTTP clients, proxies, and CDNs freely retry and cache GET requests. If your GET endpoint creates a resource or modifies state, those retries will cause duplicate writes. Browser prefetchers will also call GET links without user intent.

Status codes are part of the contract

HTTP status codes are a shared vocabulary between your API and its callers. Using them correctly means callers can handle responses generically, without special-casing every endpoint.

A few that are commonly misused:

  • 200 with an error in the body — don't do this. If the operation failed, use an appropriate 4xx or 5xx. Returning 200 with {"success": false} breaks retry logic, monitoring, and every client that checks the status code before reading the body.
  • 404 vs. 403 — if a resource exists but the caller isn't allowed to see it, returning 404 is sometimes intentional (security through obscurity — don't reveal what exists). But be consistent. If you sometimes return 403 and sometimes 404 for the same scenario, callers can't build reliable logic.
  • 500 for validation errors — a missing required field is a 400, not a 500. 5xx codes tell callers "something went wrong on our side, retry later." 4xx codes tell callers "you sent something wrong, fix your request before retrying." These have completely different retry semantics.
  • 201 for creation — when you create a resource, return 201 (Created) and include a Location header pointing to the new resource. This is the standard, and callers who follow it can automatically discover the resource URL.

4. gRPC in Depth

gRPC was built at Google to solve a specific problem: fast, efficient service-to-service communication at scale. It uses Protocol Buffers (protobufs) as the serialization format and HTTP/2 as the transport. Both choices matter.

Protobufs are binary — much smaller on the wire than JSON, and much faster to serialize and deserialize. HTTP/2 gives you multiplexing (multiple requests on one connection), header compression, and native streaming. For internal services that call each other thousands of times per second, this is a meaningful difference.

Why gRPC is an underrated API design tool

The thing most people overlook about gRPC is that its schema-first approach — you define your service in a .proto file and generate clients and servers from it — is actually a strong forcing function for good API design.

When you write a proto file, you are forced to think about the interface explicitly, before writing any implementation. The generated client code is strongly typed, which means callers get compile-time errors when the API changes incompatibly. This is the kind of safety net that JSON-over-HTTP rarely provides.

Example — a gRPC service definition
service UserService {
  rpc GetUser (GetUserRequest) returns (User);
  rpc ListUsers (ListUsersRequest) returns (ListUsersResponse);
  rpc UpdateUser (UpdateUserRequest) returns (User);
  rpc WatchUserEvents (WatchRequest) returns (stream UserEvent);
}

message User {
  string id = 1;
  string email = 2;
  string display_name = 3;
  google.protobuf.Timestamp created_at = 4;
}

Notice the last RPC: returns (stream UserEvent). Server-side streaming is a first-class concept in gRPC. This is something REST has no clean equivalent for — you'd reach for WebSockets or Server-Sent Events, both of which have their own complexity.

Where gRPC struggles

gRPC's weaknesses are real. Browsers cannot call gRPC services directly without a proxy (like gRPC-Web). The binary format, while efficient, is not human-readable — debugging a gRPC call is harder than curling a REST endpoint. Schema changes require recompiling clients. And the tooling ecosystem, while improving, is not as mature as the REST ecosystem.

These weaknesses make gRPC a poor choice for public-facing APIs where you don't control the clients. They matter less for internal service meshes where you do.

5. GraphQL in Depth

GraphQL solves a specific problem that REST doesn't solve well: clients that need different shapes of the same data. A mobile app might need a user's name and avatar. A desktop app might need their full profile, settings, and recent activity. With REST, you either have a fat endpoint that returns everything (over-fetching) or you build multiple endpoints (which then diverge over time). With GraphQL, the client specifies exactly what it needs.

This flexibility is genuinely valuable. But it comes with costs that are often underestimated at the start of a project.

The N+1 problem

The most common GraphQL performance problem: a client asks for a list of posts, each with its author's name. A naive implementation fetches the list of posts, then for each post, makes a separate database call to fetch the author. If there are 50 posts, that's 51 database calls for a single API request.

The solution is DataLoader — a batching mechanism that groups individual lookups into a single bulk query. But it requires deliberate implementation. In a REST API, over-fetching is visible (you return more fields than needed). N+1 is invisible until it shows up in a slow query log.

Caching is hard with GraphQL

REST caching is straightforward: a GET to /users/123 can be cached by URL. CDNs, browsers, and HTTP proxies all understand this. With GraphQL, almost everything goes through POST /graphql. The same URL, different body. HTTP caches key on the URL, so they see the same request every time.

You can work around this with persisted queries (pre-register query strings, use their hash as a cache key), but it adds complexity. For APIs where caching is critical — high-traffic, public-facing — this matters a lot.

Security surface area

GraphQL lets clients construct arbitrary queries. This means a malicious or careless client can write a deeply nested query that traverses your entire data graph and brings your database to its knees. You need query depth limits, query complexity limits, and query cost analysis before you expose GraphQL to untrusted callers.

GraphQL is Not a Silver Bullet

GraphQL is excellent for product APIs consumed by teams that own both the client and the server. It becomes hard to manage when exposed publicly, because every possible query shape is now your API surface — and you can't easily know what shapes callers depend on.

If your use case is "a single team building a product with a web app and a mobile app," GraphQL is a good fit. If your use case is "a public API consumed by hundreds of third-party developers," REST is more predictable to maintain.

6. The API Contract — What You Actually Owe Callers

An API contract is not just the list of endpoints and their fields. It is everything observable about your API that callers might depend on. This is broader than most engineers realize.

The contract includes:

  • Schema — the fields in each request and response, their types, and which ones are required vs. optional.
  • Semantics — what each operation actually does, including side effects.
  • Ordering — if you return a list, is the order guaranteed? Callers will assume yes.
  • Nullability — can this field be null? If you don't say, callers will assume it can't be, and crash when it is.
  • Error conditions — under what circumstances do you return each error code?
  • Performance — if your p99 latency is 20ms and it suddenly becomes 200ms, callers whose timeouts were set to 100ms will start failing. Latency is part of the contract.
  • Rate limits — the maximum call rate callers can rely on.

Most API documentation covers the schema. Very few cover the rest. The gaps in the documentation become the source of most API-related incidents.

The principle of minimal surface area

Every field you add to a response is a field you might eventually need to remove, rename, or change the type of. Every field you expose is a field callers will start depending on. This creates a direct cost: the more fields you expose, the more constrained your future changes become.

The practical implication: don't expose fields you don't have a known use case for. "We might need this later" is not a good reason to put it in the API response today. Adding a field later is easy. Removing one is almost impossible.

Design Principle: Postel's Law, Carefully

Postel's Law says: be conservative in what you send, be liberal in what you accept. For APIs, this means: accept extra fields in requests (ignore what you don't understand), but be precise about what you return. This gives you room to add fields to requests later without breaking callers, while keeping the response surface predictable.

The caveat: being too liberal in what you accept can lead to "accidental dependencies" where callers send malformed data that happens to work, and then you can't tighten the validation without breaking them.

7. Versioning — The Trap That's Easy to Fall Into

Every API versioning strategy has a cost. The question is not "should I version my API?" — you should plan for it — but "what versioning strategy am I signing up to maintain?"

The three common strategies

URL versioning/v1/users, /v2/users. This is the most common approach and the most visible. Callers can see which version they're using. The cost: you now maintain two (or more) complete implementations simultaneously. When a bug exists in v1, do you fix it there too? What does "sunset" mean — when can you stop serving v1?

Header versioningAccept: application/vnd.myapi.v2+json. Keeps URLs clean. But it's harder to test in a browser, harder for callers to notice which version they're using, and the caching story is complicated (the URL is the same, but the response is different).

No explicit versioning — backward-compatible evolution. This is the hardest discipline but the cleanest outcome. You never create a v2 — instead, you evolve the API by only making backward-compatible changes. We'll cover what "backward-compatible" means in detail below.

The versioning graveyard

Here is what typically happens in practice: a team ships v1. It works. Two years later, they need a breaking change, so they ship v2. They plan to sunset v1 in six months. But some callers don't upgrade. Those callers turn out to be critical partners, or internal teams whose owners changed, or mobile app versions that are still in use. The six-month sunset becomes eighteen months. Meanwhile, every feature needs to be implemented in both versions. The team now spends 30% of its time maintaining a version it never wanted to keep.

At three versions, this is painful. At five versions, it can paralyze a team.

Sunset Policies Must Be Decided Before You Launch, Not After

Before you publish any versioned API, decide: what is the sunset policy? How long will you support a version after its successor launches? What is the process for notifying callers? Who is responsible for tracking which callers are on which version? These decisions feel premature when you have one version. They are extremely difficult to make after you have callers who depend on the old version.

The discipline of no-version evolution

The most durable approach — and the hardest — is to evolve the API without ever creating a new version. This requires strict backward-compatibility discipline. The rules are:

  • You can add new optional fields to responses. Callers that don't know about them will ignore them.
  • You can add new optional fields to requests. Existing callers won't send them; use sensible defaults.
  • You can add new endpoints. Existing callers won't call them.
  • You cannot remove fields from responses. Callers may depend on them.
  • You cannot change the type of a field (e.g., string to integer). Callers will fail to parse it.
  • You cannot change the meaning of a field. If status: "active" meant "paid and active" and you change it to mean "any active regardless of payment," you have broken callers silently.
  • You cannot make an optional field required. Callers who don't send it will start failing.
  • You cannot change error codes. Callers who catch a specific code for retry logic will break.

This list is restrictive, and that is the point. The discipline of backward-compatible evolution forces you to think hard before you add anything, because you can't easily take it back.

8. Pagination, Filtering, and Sorting — The Details That Break Systems

Pagination is boring. It is also the source of a surprising number of production incidents. The reason: a missing or broken pagination design does not hurt you until the data grows. When it does, the caller that was fetching 50 records starts fetching 500,000. The query times out. The service goes down. Other services that share the database start failing. All because nobody thought carefully about pagination when the table had 100 rows.

Offset pagination — the naive approach and why it breaks

The most common pagination style: GET /users?page=3&per_page=50 or GET /users?offset=100&limit=50. The server runs SELECT * FROM users LIMIT 50 OFFSET 100.

This works fine for small datasets and human-facing UIs where "go to page 47" is a valid operation. It breaks in two important ways at scale:

First, performance. OFFSET 1000000 LIMIT 50 forces the database to fetch and discard one million rows before returning your fifty. The query gets slower the deeper into the dataset you go. At millions of rows, deep pagination can take seconds.

Second, consistency. If records are inserted or deleted between page fetches, the page boundaries shift. You may see the same record on two consecutive pages, or miss a record entirely. For background jobs that process all records, this can cause silent data loss.

Cursor-based pagination — the right default

The solution for most API use cases: cursor-based pagination. Instead of a page number or offset, the response includes an opaque cursor — a token that encodes your position in the result set. The next request includes that cursor, and the server returns results starting after that position.

Example — cursor pagination response shape
{
  "data": [
    { "id": "usr_123", "name": "Alice" },
    { "id": "usr_456", "name": "Bob" }
  ],
  "pagination": {
    "next_cursor": "eyJpZCI6InVzcl80NTYifQ==",
    "has_more": true
  }
}

Internally, the cursor typically encodes the ID (or a combination of sort fields) of the last record returned. The next query becomes WHERE id > :cursor_id LIMIT 50 — which is fast regardless of how far into the dataset you are, because it uses an index.

Cursors should be opaque to callers — a base64-encoded blob. This gives you the freedom to change the internal encoding (e.g., switch from ID-based to timestamp-based) without changing the API shape.

Filtering and sorting

Filtering and sorting are API surface area, and therefore permanent commitments. Before you expose a filter parameter, ask: can you actually support this at scale? A filter that works on a table with 10,000 rows may not have an efficient index for 100 million rows.

A few things to decide explicitly when designing filters:

  • What fields can be filtered? Only expose filters you have indexes for (or are willing to build and maintain).
  • What operators are supported? Exact match only? Range queries? Prefix search? Full-text search? Each has different backend requirements.
  • Are filter combinations restricted? Some combinations might be fine individually but expensive together. A filter like status=active&created_after=2020-01-01 requires a composite index to be efficient.
  • What is the sort default? Sorting by ID is usually the cheapest. Sorting by created_at requires an index on that field. Make the default explicit — callers who don't specify a sort will depend on whatever you do by default.
The Unbounded Request Trap

Every API that returns a list should have a maximum page size. Without one, a single caller can fetch your entire dataset in one request — bringing down your database and making the endpoint unavailable for everyone else. Enforce a hard cap server-side. Document it. Return an error if the caller asks for more than the cap. This is not a limitation — it is a contract about what "reasonable usage" looks like.

9. Error Design — The Half of the API Everyone Ignores

Error responses deserve as much design effort as success responses. In practice, they get a fraction of the attention. The result: callers get a 500 with no body, or a 400 with a message that says "invalid input" and nothing else. The caller has to guess what went wrong.

A well-designed error response answers three questions:

  1. What went wrong? (machine-readable error code)
  2. What does it mean? (human-readable message)
  3. What should I do about it? (is it retryable? should I change the request?)
Example — a well-structured error response
{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "The request body contains invalid fields.",
    "details": [
      {
        "field": "email",
        "code": "INVALID_FORMAT",
        "message": "Must be a valid email address."
      },
      {
        "field": "birth_date",
        "code": "VALUE_OUT_OF_RANGE",
        "message": "Birth date cannot be in the future."
      }
    ],
    "request_id": "req_7xK9mP2vQ"
  }
}

The code field is machine-readable — callers can switch on it. The message is human-readable — useful for debugging and user-facing error messages. The details array points to specific fields when there are multiple validation errors. The request_id links to server-side logs — invaluable when a caller reports "it just said error" and you need to find the trace.

The retryability question

Every error response should communicate whether it is retryable. HTTP status codes encode this partially: 5xx errors are usually retryable (server problem, maybe transient), 4xx errors usually are not (client problem, retrying with the same request won't help). But this is not always obvious — a 429 (Too Many Requests) is very much retryable, just after a delay.

If your API has errors where the retryability is nuanced, be explicit:

Example — retryability hints in the error response
{
  "error": {
    "code": "RESOURCE_TEMPORARILY_UNAVAILABLE",
    "message": "The requested resource is locked by another operation.",
    "retryable": true,
    "retry_after_seconds": 5
  }
}

Callers who implement automatic retry logic will use this. The alternative — no guidance — means callers either don't retry at all (missing transient errors) or retry immediately in a tight loop (hammering your service during an outage).

10. Idempotency in APIs — Making Mutations Safe to Retry

An operation is idempotent if calling it multiple times produces the same result as calling it once. GET is naturally idempotent — reading a record doesn't change it. PUT is designed to be idempotent — setting a field to a value is the same whether you do it once or ten times. POST is the hard one, because it typically creates something new, and creating the same thing twice is a problem.

Why does this matter? Because networks are unreliable. A client sends a request. The server processes it and sends a response. The response is lost in transit. The client, seeing no response, times out and retries. If the original request was not idempotent, the server processes it again — and you now have a duplicate charge, a duplicate order, a duplicate record.

Idempotency keys

The standard solution for making POST requests idempotent: idempotency keys. The client generates a unique key (typically a UUID) for each logical operation and sends it in a header. The server stores the response associated with that key. If it sees the same key again, it returns the stored response instead of processing the request again.

Example — sending an idempotency key
POST /payments
Idempotency-Key: 7f3d2a1c-4b8e-4f2d-9a3c-1e5b7d9f2a4c
Content-Type: application/json

{
  "amount": 5000,
  "currency": "USD",
  "account_id": "acct_789"
}

The server's behavior: on the first request with this key, process the payment and store the result associated with the key. On any subsequent request with the same key, return the stored result without processing again.

Important implementation details:

  • The idempotency key must be scoped. A key generated by one user cannot reuse results from another user's request with the same key.
  • The stored result must have a bounded lifetime. Storing results forever is expensive; 24 hours is a common window.
  • If the same key is received with a different request body, that is a client error (the client is misusing the key), not a silent success.
  • The operation must be atomic with the storage of the result. If you process the payment but fail to store the result, the next retry will process it again.
Idempotency Keys and Exactly-Once Delivery

Idempotency keys give you at-least-once processing with deduplication — which is operationally equivalent to exactly-once from the caller's perspective. This is the most practical way to achieve "exactly-once" semantics in a distributed system. True exactly-once delivery at the network level is theoretically impossible (see the Two Generals Problem). Idempotent operations with deduplication are the real-world answer.

11. Rate Limiting — Communicating What "Normal" Looks Like

Most engineers think of rate limiting as a defensive mechanism: protect your service from being overwhelmed. That is one purpose. But there is a second purpose that is equally important: rate limiting communicates to callers what normal usage looks like.

If you have no rate limits, callers will develop usage patterns based on whatever works in practice. Some will implement aggressive polling. Some will fan out requests unnecessarily. When you then try to introduce rate limits to manage load, you find that legitimate callers are already exceeding them — and you're breaking their integration.

Define your rate limits before you have callers, not after.

What to communicate when you rate limit

When a caller hits a rate limit, the response should tell them:

  • That they are rate limited (429 Too Many Requests)
  • How long until they can try again (Retry-After header)
  • What their limit is and how much they have remaining (standard: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset)
Example — rate limit response headers
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714392030
Content-Type: application/json

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "You have exceeded your request quota. Please retry after 30 seconds."
  }
}

Rate limit by the right dimension

Rate limits are typically applied per API key, per user, or per IP. The right dimension depends on your API's threat model:

  • Per API key — right for most APIs. Each integration has its own quota. One caller going rogue doesn't affect others.
  • Per user — right when the same user can have multiple API keys or sessions. Prevents one user from bypassing a per-key limit by rotating keys.
  • Per IP — useful for unauthenticated endpoints, but fragile. NAT means many users can share one IP; rate-limiting the IP limits all of them together.
  • Globally — a hard ceiling regardless of the caller. This protects your service in the absolute worst case, but should be set high enough that no legitimate caller hits it.

12. The Backward Compatibility Checklist

Before shipping any API change, run through this list. Every "no" on the safe list, or "yes" on the breaking list, means you are making a breaking change and need to version or find an alternative approach.

Safe changes — you can always do these

  • Adding a new optional field to a response
  • Adding a new optional field to a request with a sensible default
  • Adding a new endpoint
  • Adding a new enum value to a field you own (with the caveat that callers who switch exhaustively will need to handle the new value)
  • Making a previously required response field optional (callers already handle it being present)
  • Increasing a rate limit
  • Improving response latency

Breaking changes — these require a version or a migration plan

  • Removing a field from a response
  • Renaming a field (same as removing the old and adding the new)
  • Changing the type of a field (string to integer, object to array)
  • Changing the meaning of a field
  • Making a previously optional request field required
  • Changing the URL structure of existing endpoints
  • Changing error codes or error semantics
  • Changing authentication requirements
  • Decreasing a rate limit
  • Changing guaranteed ordering of list results
  • Removing an endpoint
The Expand-Contract Pattern for Migrations

When you must make a breaking change without a version bump, use expand-contract: First expand — add the new field alongside the old one, and populate both. Then wait until all callers have migrated to the new field. Then contract — remove the old field.

This requires patience and tracking which callers are on which field, but it avoids a hard cutover. It works well for renaming fields, changing types (support both for a transition period), and evolving enums.


Chapter 23 — End Summary

The Key Principle

An API is a contract with every caller, published across a temporal and organizational boundary. The interface you publish today will be harder to change than any code you write — design it like a decision you'll live with for years, because you will.

The Most Common Mistake

Treating API design as a coding problem instead of a coordination problem. Engineers design an API that works for today's requirements, expose it, and then discover — too late — that it cannot evolve without breaking callers they didn't know they had.

Three Questions for Your Next Design Review

  • If we needed to remove this field from the response in six months, what would break?
  • Have we defined explicit pagination limits and cursor semantics, or are we relying on callers to be reasonable?
  • Are all our state-mutating endpoints idempotent? If a caller retries the same request twice, what happens?