Part VI — Extensibility  ·  Chapter 25

Platform Thinking

Building Systems That Other Systems Build On

~45 min read
Platform Engineering
Developer Experience
API Design

📍 What's Coming in This Chapter

  • What makes something a platform instead of just a service — and why that difference changes everything about how you design it
  • The difference between primitives and solutions, and how getting this wrong leads to platforms nobody uses
  • The platform trap: why teams build what they think other teams need instead of what they actually need
  • How to treat developer experience as a real engineering problem, not a nice-to-have
  • The hidden cost of platform ownership — why it's never done and how to staff for it honestly
  • Real patterns for onboarding, versioning, and deprecation that keep a platform healthy as it grows
  • How to measure whether your platform is actually working

⚡ Key Learnings — Read This If You're Short on Time

  1. A platform is a product whose customers are engineers. It's not just a shared service. It's something other teams build on top of. That means their productivity and reliability depends on your design decisions. Treat it with that weight.
  2. Expose primitives, not solutions. A platform that solves one specific use case locks teams in. A platform that exposes the right building blocks lets teams compose solutions you never anticipated. But go too low-level and nobody can use it. The art is finding the right layer of abstraction.
  3. The platform trap is building in isolation. Talk to the teams who will use your platform before you build anything. Not once — continuously. The features you think they need and the features they actually need are usually different things.
  4. Developer experience is a first-class engineering concern. A platform that takes 3 days to onboard onto will not get adopted, no matter how technically excellent it is. Time-to-first-success is a metric. Treat it like p99 latency.
  5. Every API you publish is a long-term contract. The moment another team builds on your API, you own it. Breaking changes have a blast radius proportional to your adoption. Version carefully and deprecate honestly.
  6. Platform ownership never ends. A platform that ships and is declared "done" will slowly rot. Bugs accumulate, security patches get delayed, and teams stop trusting it. Budget for ongoing maintenance from day one — typically 30–40% of the team's capacity.
  7. Adoption is the only real success metric. A platform that nobody uses, no matter how clean its code, is a failed platform. Measure active teams, weekly API calls, time-to-onboard, and support ticket volume. These tell the truth.

What Is a Platform, Really?

The word "platform" gets used loosely. A team builds a shared authentication service and calls it a platform. Someone writes a common database library and puts "platform" in the repo name. But there's a real distinction worth making clearly.

A service does something for you. A platform lets you build something.

When you call a payment service, it processes a payment and returns a result. You're a consumer. When you build on a platform like AWS or Stripe, you're constructing something. The platform gives you raw capabilities — compute, storage, payment primitives — and your product is built on top of those. The platform exists not to serve one use case but to enable a class of use cases that the platform team can't even fully predict.

This distinction changes everything about how you design. A service is optimized for one job. A platform is optimized for enabling others to do jobs you haven't imagined yet. That means your design decisions have much higher leverage — and much higher cost when they're wrong.

💡 Insight

The test of whether you've built a platform vs. a service: can a team use it to solve a problem you didn't anticipate, without changing your code? If yes, it's a platform. If every new use case requires you to add something, it's a service pretending to be a platform.

The Multiplier Effect

Here's why platform thinking matters so much. If you build a feature that helps one team, you've created value for one team. If you build a platform that ten teams build products on, you've created value for ten teams — and also for all of their users. Your engineering leverage is 10x higher.

But this also means your mistakes are 10x more expensive. A bad API design that one team works around is a minor inconvenience. The same bad API design that ten teams have built on is a multi-year migration project. Getting the foundations right matters more when others are building on top of them.

Primitives vs. Solutions: The Most Important Design Choice

When you're building a platform, the most consequential decision is: how much should I solve for the user?

Imagine you're building a platform for sending notifications — emails, SMS, push notifications. You have two options:

Option A — Expose a solution: Build an API that takes a user ID and a message type ("welcome", "password_reset", "order_shipped") and handles everything else. Teams call notify(userId, "order_shipped") and they're done.

Option B — Expose primitives: Build an API that lets teams specify the channel (email/SMS/push), the template, the recipient, the priority, and the scheduling options. Teams compose these to build their own notification flows.

Option A is easier to use. Option B is more powerful. The question is: which is the right level of abstraction for a platform?

The answer is almost always: both, at different layers. The platform exposes primitives. The platform also ships high-level wrappers that implement the most common solutions on top of those primitives. Teams who need simple use cases use the wrappers. Teams who need custom behavior drop down to the primitives.

Notification Platform — Layers of Abstraction
┌──────────────────────────────────────────────┐
│ High-Level Wrappers (pre-built solutions) │
│ notify(userId, "order_shipped") │
│ notifyPasswordReset(userId) │
└─────────────────────┬────────────────────────┘
                     │ built on top of
┌─────────────────────▼────────────────────────┐
│ Platform Primitives │
│ send(channel, template, recipient, options) │
│ schedule(job, at, priority) │
│ template.render(id, vars) │
└──────────────────────────────────────────────┘
                     │ team builds on
┌─────────────────────▼────────────────────────┐
│ Product Team's Code │
│ Their custom notification logic, │
│ batching rules, retry policies │
└──────────────────────────────────────────────┘

The wrappers serve two purposes. They reduce friction for simple use cases. And they act as documentation — they show teams how to compose the primitives correctly, so the platform team's intent is clear.

The Two Failure Modes: Too Low and Too High

Too low-level: You expose raw infrastructure. "Here are the Kafka topics, here's the schema registry, here's the offset management API." This is powerful in theory, but in practice most teams don't want to think at this level. They'll build brittle wrappers themselves, all slightly differently, and you'll end up with ten different in-house "notification clients" across the company, each with different bugs.

Too high-level: You solve every case up front. "We support welcome emails, password resets, and order updates." Now a team that needs to send a custom transactional email can't use your platform at all, or they need to file a ticket and wait for you to add their use case. You become a bottleneck.

⚠️ The Rule

The right abstraction level for a platform is the one where teams can solve their specific problems without your involvement after initial onboarding. If teams regularly file tickets asking you to add capabilities, your abstraction is too high. If teams regularly ask for examples and get confused, your abstraction is too low.

The Platform Trap: Building in Isolation

Here's the most common way platform projects fail. A team is given a mandate: "Build a platform that all our product teams can use for X." They're smart engineers. They think hard about the use cases. They design a clean API. They build it. Then they announce it.

And nobody adopts it.

Or teams try it, hit friction, and quietly build their own solutions. Or two teams adopt it, the other eight don't, and the platform team spends all their time serving the two early adopters.

What went wrong? They built what they thought teams needed, not what teams actually needed. And they found out after building, not before.

Talk to Your Users Before You Build Anything

This sounds obvious, but almost every platform team skips it. Before writing a line of platform code, spend two to four weeks doing this:

  1. Sit down with the five or six teams who will be your first users. Not their managers — the engineers who will actually write the integration code.
  2. Ask them what they're building today. Not what they need from your platform — what they're building. You want to understand their context, their constraints, and the problems they're trying to solve.
  3. Ask them what's painful about their current approach. What do they copy-paste across codebases? What do they wish they didn't have to care about? What keeps breaking?
  4. Show them rough API sketches. Not full designs — rough sketches. Watch where they hesitate. Watch what they say would be annoying to use. Watch what they get excited about.

This process, done honestly, will change your design. Not at the margins — it will change the core assumptions. The problems you thought mattered won't match the problems they actually have. This is not a failure of your thinking. It's just that you can't know what another team's daily work feels like until you ask.

"The features you think they need and the features they actually need are usually different things. Find out which is which before you build."

Your Internal Customers Are Still Customers

One mental shift that helps: treat internal teams as customers with real choices. They can build their own solution instead of using yours. They can use an external tool. They can just copy-paste. They're not forced to use your platform just because it exists.

This means you have to earn adoption the same way an external product earns customers: by being genuinely better than the alternative. The alternative for an internal team is usually "write it yourself." If writing it yourself takes two days and onboarding onto your platform takes three days, teams will write it themselves and you'll wonder why adoption is low.

Developer Experience Is a First-Class Engineering Concern

In the external software world, developer experience (DX) has become a recognized discipline. Companies have DX teams. There are DX-focused roles. The logic is simple: a developer tool that engineers love to use gets adopted. One that frustrates them doesn't.

Internal platform teams almost never apply this logic to themselves. They think: "Our engineers have to use this, they don't have a choice." But as we just established — they do have a choice. And even if they don't, frustrated engineers build on fragile foundations, make mistakes, and blame your platform when things go wrong.

Time-to-First-Success: The Metric That Matters

The single most important DX metric for a platform is time-to-first-success: how long does it take a new engineer, who has never used your platform before, to have their first thing actually working?

Not reading the documentation. Not understanding the architecture. Actually working — real code, doing a real thing, in their real environment.

If this number is over 30 minutes, you have a DX problem. If it's over 2 hours, you have a serious DX problem. If it's over a day, your platform will not get adopted by engineers who have a choice.

Measure this by watching engineers onboard — literally sitting with them while they try to get started. Don't help them. Watch where they get stuck. Every place they get stuck is a DX bug that you should fix with the same urgency as a production bug.

What Good DX Actually Looks Like

A working example before anything else. The first thing a new user should see is a complete, runnable example that does something non-trivial. Not "step 1: install the SDK, step 2: configure credentials, step 3: read the concepts page." One example that works, end to end, immediately.

Error messages that tell you what to do. When something goes wrong, the error message should say what went wrong and what to do about it. "Connection refused" is a bad error message. "Could not connect to the notification service at localhost:8080. Is the service running? See: docs.company.com/notification/local-setup" is a good error message. This takes real engineering effort to do well. Do it.

A local development experience. Engineers should be able to use your platform on their laptop without connecting to production infrastructure. This means local simulators, Docker-compose setups, or fake implementations. A platform that requires a VPN and production credentials to do basic development will see engineers work around it.

SDK over raw API. Always ship a client library in the languages your teams use. Raw HTTP/gRPC APIs are fine as the foundation, but nobody wants to write HTTP clients. Your SDK should handle authentication, retries, serialization, and circuit breaking. The caller should not think about any of this.

💡 Insight

The best platforms feel "obvious" in hindsight. When engineers finish reading your quickstart guide, they should think "of course it works this way." That feeling of obviousness is not accidental — it's the result of obsessive iteration, user testing, and willingness to throw away clever designs in favor of simple ones.

Every API You Publish Is a Long-Term Contract

When you're building an internal service for yourself, you can change the API whenever you want. The code that calls it and the code that implements it are often in the same codebase, and a refactor is a few hours of work.

When other teams build on your API, that changes completely. Every field you add, every parameter you accept, every response shape you return — all of it becomes a contract. Teams write integration code against your API. They build parsing logic, validation, and error handling around your specific response shapes. If you change any of it, their code breaks.

This means two things:

First, think hard before adding something. The moment you publish a field or an endpoint, you own it. Removing it is painful. Changing its semantics is painful. "We can always fix it later" is almost never true for an API with real consumers. The cost of getting it right up front is low. The cost of migration is high.

Second, version from the beginning. Even if you don't think you'll need it. A versioned API lets you ship v2 alongside v1 while teams migrate at their own pace. An unversioned API means you either break everyone simultaneously or you never make breaking changes and the API slowly accumulates cruft.

Versioning: The Practical Approach

There are several versioning strategies, and they have different trade-offs.

Strategy How it works Good for Pain point
URL versioning
/v1/notify, /v2/notify
Version in the URL path. Old and new endpoints coexist. REST APIs, easy to debug in logs Running two versions in production indefinitely is expensive
Header versioning
API-Version: 2024-01
Caller specifies version in a header. Same URL, different behavior. Clean URLs, easy to deprecate old versions Harder to see in logs, callers must set the header
Additive-only Only add fields, never remove or change. Schema is always backwards compatible. Event schemas, data pipelines Old fields accumulate forever, schema gets confusing
Date-based versions
2024-01-01
Stripe's model — each version is a snapshot of the API on that date. APIs with frequent but small changes Complex to implement, requires tracking changes per version

For most internal platforms, URL versioning or additive-only schemas work well. The key rule: never make a breaking change to a version that has active consumers. A breaking change means: removing a field, changing a field's type, changing the semantics of a field (even if the type stays the same), or removing an endpoint.

Deprecation: How to Remove the Old Without Breaking the World

At some point, you'll want to retire an old API version. This is good — unmaintained old versions accumulate security vulnerabilities and bugs. But doing it wrong is painful for everyone.

A good deprecation process looks like this:

  1. Announce early. Give teams at least one full quarter of notice before removing anything. Two quarters is better for large platforms. Post in the appropriate channel, send email, update documentation. Make it hard to miss.
  2. Identify all active consumers. Don't rely on teams to come to you. Look at your API metrics — which teams are still calling the old version? Go to those teams individually. A deprecation notice posted in a Slack channel doesn't reach the team that's on a different continent or has been heads-down on another project.
  3. Provide a migration guide. Don't just say "please upgrade to v2." Give them a step-by-step guide with code examples showing exactly what changes. If the migration is large, offer to help with the first one so the team has a template.
  4. Enforce with metrics, not promises. Track the number of v1 callers over time. Make this visible. Set a hard deadline and stick to it. Soft deadlines ("we'd like to remove this sometime next year") don't drive action.
  5. Remove it on the announced date. This is the part most platform teams skip. If you keep extending the deadline every time a team asks, you'll never remove anything. Keeping your word about removal is what makes future deprecation notices credible.
🚨 Anti-Pattern

The silent deprecation: you stop supporting v1, it starts failing intermittently, and teams find out when their product breaks at 2am. This destroys trust in the platform and is harder to recover from than almost any technical problem. Always deprecate loudly, with plenty of notice, and with active outreach to affected teams.

The Tax of Platform Ownership: It's Never Done

Here's what most platform projects look like in practice. The team gets a mandate, builds the platform, ships it with fanfare, and then considers themselves done. Now they want to move on to the next interesting problem.

Six months later, the platform has accumulated bugs that nobody's fixed. Two security patches haven't been applied because "we're busy." Three teams have encountered edge cases that fall outside what the platform was designed for, and they've built fragile workarounds. The documentation was accurate at launch and is now misleading. Support tickets are piling up.

A platform that ships and is declared "done" will slowly rot. This is not a metaphor. Software actively degrades without maintenance — the world around it changes even when the code doesn't.

The Maintenance Budget

Before you commit to building a platform, you need to have an honest conversation about ongoing staffing. A rough model that works in practice:

  • The initial build requires your full team capacity.
  • After launch, you'll need roughly 30–40% of team capacity permanently dedicated to maintenance: bugs, security patches, dependency upgrades, documentation updates, support tickets, and migration help for consumers.
  • The remaining capacity can be used for new features — but only after the maintenance work is current.

If your platform team is two engineers, that means roughly one engineer-day per day is maintenance. If your manager expects two engineers to be shipping new features at full speed after launch, the platform will rot. This conversation needs to happen before you start building, not after your launch retrospective.

Choosing a Support Model

As your platform grows, support will become a significant time sink. There are a few ways to handle this.

Dedicated support rotation. One engineer per week is the "platform support engineer." They own all inbound questions, bug reports, and onboarding calls. This prevents context-switching for the rest of the team. The downside: this engineer is mostly reactive for the week, which some engineers dislike.

Self-service first. Invest heavily in documentation, runbooks, and error messages that help teams help themselves. Every support ticket that comes in gets answered, and then the platform team asks: "how do we make this question unnecessary?" The answer usually involves better documentation, better error messages, or better default behavior.

Community model. For mature platforms with many consumers, establish a Slack channel where both the platform team and experienced consumers answer questions. Consumer-to-consumer support is highly scalable and also gives the platform team signal about what's confusing without requiring a ticket.

Whatever model you choose, the key principle is: support ticket volume is a signal about DX quality, not just a chore to get through. Every recurring question is documentation that's missing. Every recurring error is a better default that you haven't implemented yet.

Measuring Whether Your Platform Is Working

A common mistake is measuring a platform by its technical properties: uptime, latency, test coverage, number of features. These are necessary but not sufficient. A platform can be perfectly reliable and still fail if nobody uses it.

The metrics that actually matter for a platform fall into three groups.

Adoption Metrics

Number of teams actively using the platform. "Active" means called the platform at least once in the last 30 days. This is your headline number. It should grow over time. If it's flat, you have an adoption problem. If it's declining, you have a trust problem.

Time-to-first-success for new teams. From the moment a new team decides to use your platform to the moment they have something working in production — how long does this take? Track it for every new team. The median should be falling over time as you improve DX.

Percentage of new projects using the platform vs. rolling their own. This tells you whether the platform has become the default or whether teams still default to building custom solutions. If 70% of new projects roll their own, there's a reason — find it.

Health Metrics

Support ticket volume per active team. If this is going up, something is getting more confusing. If it's going down, your DX investments are working.

Time to resolve a consumer-reported bug. When a team finds a bug in your platform, how long does it take you to fix it? If this number is weeks or months, teams will lose trust and start working around your platform instead of through it.

Documentation freshness. How many pages of your documentation are more than one release cycle out of date? Stale documentation is actively harmful — it's worse than no documentation because it leads teams in the wrong direction.

Business Impact Metrics

At the end of the day, a platform exists to speed up other teams. Measure this directly where you can.

Estimated engineering time saved. If your notification platform means each team doesn't need to build and maintain their own notification infrastructure, estimate how much time that saves per team per year. Multiply by adoption. This number tells your leadership why the platform team should continue to exist.

Incidents caused by platform vs. prevented by platform. Track both. A mature platform should prevent more incidents than it causes — because it handles things like retry logic, circuit breaking, and graceful degradation that product teams would implement poorly or not at all.

Platform Growth Pains: What Happens When It Works

If your platform is successful, it will eventually be used by more teams than you can personally know. This is when a new set of problems appears.

Governance: How Decisions Get Made

When you have two consumers, API decisions are easy — you talk to both teams and make a call. When you have twenty consumers, this breaks down. Every design decision has more stakeholders. Competing requests come in. Teams start building on undocumented behaviors that you didn't intend to support.

At this scale, you need a lightweight governance model. Not a committee or a bureaucracy. But a clear answer to: how do API changes get proposed, reviewed, and decided?

A model that works at medium scale (10–50 consumer teams):

  • All API changes are proposed as a short written document (not a ticket — a doc with problem statement, proposal, and alternatives considered)
  • The doc is shared with all consumer teams for a fixed review window (5 business days is usually enough)
  • The platform team makes the final call, but any team can block a change if they can show it would break them
  • All decisions, including the reasoning, are logged in a shared decision log

This process looks slow, but it's actually fast compared to the alternative: making changes silently and discovering 3 months later that two teams are broken and didn't say anything because they didn't know where to report it.

Avoiding Feature Capture

As adoption grows, you'll face pressure to add features that serve specific teams rather than the general platform. Team A needs a specialized retry policy. Team B wants a custom serialization format. Team C has a compliance requirement that needs a special audit log endpoint.

Each of these requests is reasonable in isolation. Together, they turn a clean platform into a complicated special-case machine that serves nobody well.

The discipline here is maintaining a clear platform boundary. Before adding something, ask: is this general, or is this a workaround for one team's specific situation? If it's general, build it. If it's specific, help the team build it as a layer on top of your primitives — and let them own it.

⚠️ Watch Out

Feature requests from your biggest or most influential consumers get the most political pressure behind them. The loudest teams are not always the ones with the most representative needs. Be especially skeptical of "urgent" requests that only serve one team — urgency often reflects their planning problems, not your platform's gaps.

Patterns from Real Platforms That Work

Let's look at some concrete patterns from successful internal platforms, and why they work.

The Golden Path

Popularized by Spotify, the "golden path" is a prescriptive, opinionated way to do common tasks using the platform. It's not the only way — teams can deviate — but it's the way the platform team recommends, supports, and optimizes.

The golden path for a new service might be: use this template repository, configure these three things, run this CLI command, and you have a running service that's already connected to your observability platform, your deployment pipeline, and your notification platform. Zero decisions required.

Teams that follow the golden path get full platform support. Teams that deviate are on their own for the deviated parts. This isn't punitive — it's just honest about where the platform team's expertise and bandwidth actually are.

The golden path works because it shifts the mental model from "here are building blocks, figure it out" to "here's the right way, here's why, and here's the escape hatch when you need something different."

Scaffolding and Code Generation

The fastest way to reduce time-to-first-success is to generate the boilerplate. If using your platform requires 50 lines of setup code, teams will type those 50 lines differently every time, make mistakes, and end up with slightly different configurations across codebases.

A better approach: provide a CLI tool that generates the boilerplate. platform init my-service --type=notification-consumer produces a working skeleton with all the right configuration, imports, and example code. The engineer fills in the business logic.

Code generation feels like a nice-to-have, but it has outsized impact on adoption. It's the difference between "I need to read the docs for an hour before I can write any code" and "I'm writing real code in 10 minutes."

Always Provide Escape Hatches

No platform covers every use case. Accepting this is important. If your platform doesn't have an escape hatch for unusual requirements, teams will either use your platform badly (hacking around limitations) or abandon it entirely for their special case — and then maybe for everything.

An escape hatch is a documented, supported way to drop down to a lower level of abstraction when you need to. The notification platform's escape hatch might be: direct access to the underlying message queue, so teams can implement custom delivery logic that the high-level API doesn't support.

The escape hatch should be harder to use than the main API — you want teams to reach for it only when necessary. But it should be well-documented and not embarrassing to use. "We designed this so you'd never need it" is a prediction, and it's usually wrong.

The Chapter Principle

A platform is not done when it ships — it's done when teams can build on it without your involvement. Everything before that point is building toward the platform; everything after is maintaining the trust that makes it worth building on.

The Most Common Mistake

  • Building in isolation — designing the API based on what you think teams need, finding out you were wrong after you've built it, and being too invested to start over.
  • Declaring the platform "done" after launch and moving on, leaving it to slowly rot while consumer teams lose trust.
  • Adding features for specific teams under political pressure, turning a clean platform into a tangled special-case machine.

Three Questions for Your Next Design Review

  1. Can a team use this platform to solve a problem we didn't anticipate, without filing a ticket or waiting for us to add something? If no — are we building a platform or a service?
  2. What is our time-to-first-success target, and have we actually measured it by watching a real engineer try to onboard without help?
  3. What is our maintenance budget after launch, and have we had an explicit conversation with our manager about what percentage of team capacity is permanently reserved for it?