How to ship changes to a live system without waking anyone up at 3 am.
Deploying software sounds simple. You build it, you ship it. But in a system with millions of users, real money moving around, and dozens of engineers pushing code every day, the act of deploying is itself a distributed systems problem. A bad deploy can take down your entire service in seconds. A good deployment system lets you ship dozens of times a day with confidence.
This chapter covers the techniques that make shipping safe:
Deployment and release are not the same thing. Deploying puts code on servers. Releasing makes it visible to users. Feature flags let you do one without the other.
A canary deployment is only as good as its success criteria. If you don't define what "good" looks like before you deploy, you're flying blind during the rollout.
Blue-green is easy for stateless services, hard for databases. The database is almost always the part that makes zero-downtime deployments complicated.
Never run a raw ALTER TABLE on a large live table. It locks the table. Use the expand-contract pattern or online schema change tools instead.
The expand-contract pattern is a three-step process. Add the new thing (expand), migrate data, then remove the old thing (contract). Each step is a separate deploy.
Build artifacts once, promote through environments. The same binary that passes staging should go to production — never build again for production.
Feature flags accumulate technical debt. Every flag you create is a branch in your code. If you don't clean them up, you end up with a codebase that nobody fully understands.
Rollback plans need to be practiced, not just documented. A rollback you've never tried is a rollback that will fail exactly when you need it most.
In the early days of a project, deploying is easy. You have one server, a small team, and maybe a few hundred users. You run a script, the new code goes up, and you watch the logs for a minute. If something breaks, you roll back. Simple.
Then the system grows. Now you have fifty servers instead of one. Millions of users who generate revenue every second. A database with two billion rows that can't just be taken offline. Teams in different time zones pushing code to the same codebase. A customer contract that promises 99.99% uptime.
Suddenly, deploying is not simple at all. The "just push the new version" approach becomes dangerous. A bug in your new code won't affect 100 users — it'll affect all of them, simultaneously, the moment you push. A database migration that takes two minutes will cause two minutes of downtime. On a busy system, two minutes of downtime might mean thousands of failed transactions and a very uncomfortable conversation with your VP.
This is the problem that deployment engineering solves: how do you change a running system without breaking it?
Before we talk about fancy techniques like canary deployments, we need to talk about the foundation: the deployment pipeline. This is the set of automated steps that every change goes through before it reaches users.
A good pipeline enforces one rule above all others:
Build the artifact once. Promote it through environments. The exact same binary that runs in your staging environment should be what you deploy to production. Never rebuild for production.
This sounds obvious, but teams violate it all the time. They build code, run it in staging, find it looks good, and then build it again for production. The problem: the build process itself can be non-deterministic. A dependency might have a new version. A build flag might be slightly different. The thing you tested is not the thing you shipped.
A typical pipeline looks like this:
┌─────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐ ┌────────────┐
│ Commit │───▶│ Build │───▶│ Tests │───▶│ Staging │───▶│ Production │
│ to Git │ │ Artifact │ │ unit/integ/ │ │ Deploy │ │ Deploy │
└─────────┘ └──────────┘ │ contract │ └──────────┘ └────────────┘
│ └──────────────┘ │
│ │
▼ ▼
┌─────────┐ ┌──────────┐
│ Artifact │ │ Smoke │
│ Registry │ │ Tests │
└─────────┘ └──────────┘
Each stage is a gate. If the artifact fails tests, it doesn't proceed to staging. If it fails smoke tests in staging, it doesn't proceed to production. This sounds slow, but with a well-built pipeline, it can all run in under 15 minutes.
Build — compile the code, run static analysis, check for known vulnerable dependencies, produce an immutable artifact with a unique identifier (usually a git SHA). This identifier follows the artifact everywhere it goes.
Unit and Integration Tests — fast tests that don't need a running production system. Should complete in a few minutes. If they take longer, teams start skipping them.
Contract Tests — verify that your service still speaks the language that its callers expect. These catch the kind of subtle API breaks that unit tests miss. We cover these in depth in Chapter 22.
Staging Deploy — deploy the artifact to an environment that mirrors production as closely as possible. Same configuration, similar data volumes, real downstream service connections where possible.
Smoke Tests — a small set of end-to-end tests that verify the critical paths work. Not comprehensive — just enough to catch "the service won't start" or "the homepage returns a 500."
Production Deploy — the artifact reaches users. But how it gets there — that's what the rest of this chapter is about.
The most important shift in modern deployment thinking is this: deploying code and releasing a feature to users are two separate events.
Without feature flags, they're the same thing. You merge code, deploy it, and users immediately see the new behavior. If something is wrong, your only option is a rollback — which takes time, is stressful, and might cause its own problems.
With feature flags, you decouple them. You can deploy code to all your servers while keeping the new behavior completely invisible to users. You turn it on for a small group first. If that works, you expand. If something is wrong, you turn the flag off — instantly, without a deploy.
A feature flag rollback takes milliseconds. A code rollback takes minutes, sometimes longer. When something breaks in production, those minutes feel like hours.
Not all flags are the same. Treating them all the same way leads to a mess. Here are the four distinct types you'll encounter:
| Type | Purpose | Lifespan | Example |
|---|---|---|---|
| Release flag | Hide incomplete or risky features during development | Days to weeks. Delete after full rollout. | new_checkout_flow_enabled |
| Experiment flag | A/B testing. Show different behavior to different users to measure impact. | Weeks. Delete after experiment concludes. | recommendation_algorithm_v2 |
| Ops flag | Runtime circuit breakers. Turn off expensive features under load. | Long-lived. These are operational levers. | enable_realtime_notifications |
| Permission flag | Enable features for specific users, customers, or tiers. | Long-lived. Part of your authorization model. | beta_access_enabled |
The most important distinction here is between short-lived flags (release, experiment) and long-lived flags (ops, permission). Short-lived flags must be deleted. If you don't have a process for this, they pile up over months until you have a codebase riddled with branches that nobody understands.
The simplest implementation is a key-value store that your application reads at request time. Something like:
# The calling code looks like this if feature_flags.is_enabled("new_checkout_flow", user_id=current_user.id): return render_new_checkout() else: return render_old_checkout()
The flag service resolves is_enabled based on rules you configure: percentage rollout, specific user IDs, user attributes like country or account tier, or simply on/off globally.
One important detail: evaluate flags at the start of a request, not mid-way through. If a user's flag evaluation changes during a multi-step transaction, you can end up in a half-new, half-old state. Evaluate once, store the result for the duration of that request.
One of the most useful things you can do with feature flags is run code in production without showing users the results. You execute the new code path, log the output, compare it to the old code path, but return the old result to the user. This is called a dark launch or shadow mode.
Dark launches let you answer the most important pre-release question: does this code work under real production traffic, with real data, at real scale? No staging environment perfectly replicates this. A dark launch does.
Dark launches are especially powerful for: new search algorithms (compare result quality), new database queries (compare performance and output), and new third-party integrations (check reliability before it affects users).
Feature flags are not free. Each flag is a branch in your code. Two flags that can both be on or off give you four possible code paths. Ten flags give you 1024. Most of those combinations are never tested.
Teams that don't clean up flags end up with what engineers call flag debt — a codebase where nobody is sure which flags are still active, what the code does with them off, or whether removing one might break something. The fix is simple but requires discipline: every short-lived flag gets a deletion ticket created when the flag is created. Not "we'll clean it up later." A ticket, with a date.
Even with feature flags, there's a class of risk that flags can't protect you from: bugs that only appear under production load, with production data, in the production environment. Your staging environment is never a perfect replica. Sometimes the only way to know if a change is safe is to run it in production — carefully.
A canary deployment does exactly this. You deploy the new version to a small subset of your servers while the rest continue running the old version. Real users get served by both versions. You watch the metrics from the new version carefully. If they look healthy, you gradually increase the percentage. If they look wrong, you pull back immediately.
Load Balancer
│
├──── 95% ────▶ [ v1.2 servers ] (old version — stable)
│
└──── 5% ────▶ [ v1.3 servers ] (new version — being watched)
│
▼
Metrics collector
(error rate, latency, business metrics)
The name comes from the old mining practice of bringing canary birds into coal mines. If the canary died, miners knew there was dangerous gas and got out. A small set of servers plays the canary — they'll show signs of a bad deploy before the whole fleet is affected.
This is the most important piece of advice in this section, and the most commonly ignored: you must define what "healthy" looks like before you start the deployment, not during it.
When you're mid-deployment and seeing some metrics go up, you'll be under pressure. Your manager is watching. The team has been waiting weeks to ship this. In that moment, it is very tempting to say "that error rate increase looks like noise" or "latency went up but it'll probably settle down." Without pre-defined thresholds, you're making emotional decisions instead of data-driven ones.
Before every canary deployment, write down:
Then automate the halting. A canary analysis tool that watches these metrics and automatically stops the rollout if thresholds are breached removes human judgment from the danger zone. Companies like Netflix (with Kayenta) and Spinnaker users have built sophisticated automated canary analysis. You don't need that sophistication immediately, but you do need automation.
A typical canary rollout looks like this:
The time windows matter. A bug that only appears when a cache expires after 20 minutes won't show up in a 5-minute soak. Choose your soak time based on how long a typical user session or transaction cycle takes in your system.
One complication: if your application has any statefulness in the request flow, you need to make sure a single user doesn't bounce between the old and new version mid-session. A user who starts checkout on v1.2 and completes it on v1.3 might encounter state format mismatches.
The fix is sticky routing during canary: route based on user ID (e.g., users whose ID hash ends in 0-4 always go to the canary). This ensures each user sees a consistent version for their entire session, and the canary population is a stable, representative sample of users.
A blue-green deployment takes a different approach. Instead of sending a fraction of traffic to the new version, you maintain two complete, identical environments — call them blue and green. At any time, one is live (serving all production traffic) and the other is idle (or running the new version under test).
When you're ready to release, you flip the router. All traffic moves from blue to green in one step. Blue becomes the standby.
Before release:
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Users │──────▶ │ Router │──────▶ │ BLUE (live) │ v1.2
└──────────┘ └──────────────┘ └──────────────┘
┌──────────────┐
│ GREEN (idle) │ v1.3 (ready)
└──────────────┘
After release (flip the router):
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Users │──────▶ │ Router │──────▶ │GREEN (live) │ v1.3
└──────────┘ └──────────────┘ └──────────────┘
┌──────────────┐
│ BLUE (idle) │ v1.2 (rollback ready)
└──────────────┘
The advantage over a canary is simplicity: no gradually increasing percentages, no mixed-version traffic, no sticky session complications. The flip is instant. And rollback is just as instant — flip back.
There's a practical issue that trips up many blue-green implementations: cache warmup. When you flip from blue to green, green's caches are cold. It hasn't served any traffic recently. The first wave of users after the flip may see much higher latencies than normal as caches fill up, even if the code is perfectly healthy.
This can be mitigated by sending a small trickle of traffic to green before the flip — enough to warm the caches but not enough to matter if there are bugs. Some teams run green at 1-2% traffic continuously before a release for this reason.
The obvious downside of blue-green is that you're running two production-sized environments simultaneously. For most application servers, this is fine — compute is cheap and the idle environment doesn't need to handle full load. But if your production environment includes expensive resources like large database instances or high-capacity storage, running two full copies becomes expensive quickly.
The practical answer most teams arrive at: use blue-green for the stateless application tier (cheap), but use canary or rolling deployments for anything that includes stateful infrastructure.
This brings us to the central challenge of blue-green deployments: you can switch application servers in an instant, but you can't have two separate databases. Both blue and green need to talk to the same database — because that's where your data actually lives.
This means your new version (green) must be able to read and write data in a format that the old version (blue) can also read and write. If your deploy includes a database schema change, you have a problem: the new version might write data in a new format that the old version doesn't understand. If you need to roll back, the old version is now looking at data it can't interpret.
This problem — how to change your database schema without breaking a running system — is so important that it deserves its own section.
Application code can be deployed and rolled back in minutes. Database schemas are much stickier. Once you've written data in a new format, rolling back the code is only half the problem — you also need to handle the data that was written by the new format.
Engineers who are new to large-scale systems often treat database migrations as an afterthought: "we'll just run the migration script when we deploy." This works fine when your table has ten thousand rows. It does not work when your table has ten billion rows, is being written to by thousands of requests per second, and cannot be taken offline.
The naive approach is to run something like:
-- Renaming a column. Seems simple. ALTER TABLE users RENAME COLUMN full_name TO display_name;
On a small table, this runs in milliseconds. On a table with 500 million rows, in most databases, this acquires an exclusive lock that blocks all reads and writes for the duration. Depending on the operation and your database engine, "duration" might mean seconds, minutes, or hours.
Even "simple" changes like adding a non-null column with a default value can be dangerous. The database may need to rewrite every row to set the default. That's a full table scan. On a busy system, this is an outage.
Never run a schema migration that requires a table lock on a large, live, production table without testing it first on a replica with production-scale data. The time it takes on a 1,000-row dev table has almost no relation to how long it takes on a 500 million-row production table.
The expand-contract pattern (also called parallel change) is the standard solution for zero-downtime schema migrations. The key insight is that instead of one atomic schema change, you break the migration into multiple backwards-compatible steps, each shipped as a separate deployment.
Let's walk through a concrete example: you want to rename a column from full_name to display_name.
The wrong way is to rename it in one step. The moment the migration runs, all code using full_name breaks.
The right way has three phases:
Phase 1 — EXPAND (add the new column, keep the old one)
─────────────────────────────────────────────────────────
Deploy: Add display_name column.
Write code to write to BOTH full_name AND display_name.
Read from full_name (old column still authoritative).
Migrate: Backfill display_name for all existing rows.
Schema: [ id | full_name | display_name ] ← both columns exist
Code: writes to both, reads from full_name
Phase 2 — SWITCH (switch reads to the new column)
─────────────────────────────────────────────────────────
Deploy: Change code to read from display_name.
Still write to both columns.
Verify: All reads now use display_name.
full_name writes continue for safe rollback.
Schema: [ id | full_name | display_name ] ← both columns exist
Code: writes to both, reads from display_name
Phase 3 — CONTRACT (remove the old column)
─────────────────────────────────────────────────────────
Deploy: Stop writing to full_name.
Migrate: DROP COLUMN full_name.
Schema: [ id | display_name ] ← clean
Code: writes to display_name only
Each phase is a separate deploy. Each phase is fully backwards-compatible with the previous one. You can pause between phases, verify things look healthy, and proceed. If something goes wrong during Phase 2, you roll back to the Phase 1 code — and everything still works because both columns still exist and both were being written to.
This is the part teams most often try to shortcut. "Why can't we do all three phases in one deploy? It's just three SQL statements."
The reason is that during a rolling deploy — even a fast one — you will always have a moment where old code and new code are running simultaneously. If you do the migration and the code change in the same deploy, during that transition window:
display_name) is running on some serversfull_name) is running on other serversfull_name was just dropped — so old code is reading a column that no longer existsEvery server still running the old code throws an error. That's a partial outage for the duration of the deploy. Depending on how long your rolling deploy takes, that could be minutes of errors.
Separate phases eliminate this. During Phase 1, both old and new code can coexist safely because both columns exist. During Phase 3, both old and new code can coexist safely because both were stopped from reading full_name in Phase 2.
In Phase 1, after adding the new column, you need to fill it in for all existing rows. For a table with tens of millions of rows, this is not a simple UPDATE users SET display_name = full_name. That update locks rows as it goes. On a busy table, it will cause contention, slow down production queries, and potentially time out or fill up your transaction log.
The safe approach for large backfills is to do it in batches:
-- Backfill in batches of 10,000 rows. -- Sleep between batches to give the database breathing room. -- Run this as a background job, not in the deploy script. last_id = 0 while True: rows_updated = db.execute(""" UPDATE users SET display_name = full_name WHERE id > %(last_id)s AND display_name IS NULL ORDER BY id LIMIT 10000 """, {"last_id": last_id}) if rows_updated == 0: break last_id = db.last_inserted_id() time.sleep(0.1) # brief pause between batches
Running backfills as background jobs rather than inline in deploy scripts has another advantage: you don't need to wait for the backfill to finish before proceeding with the deploy. The deploy completes, the system is healthy, and the backfill runs in the background over minutes or hours. New rows get both columns written by the application code. Old rows get filled in gradually by the background job.
For some types of migrations — particularly changes that MySQL or PostgreSQL would normally do with a full table lock — there are specialized tools that perform the migration online without locking the table.
gh-ost (GitHub's Online Schema Migrator) and pt-online-schema-change (Percona Toolkit) both work similarly: they create a shadow copy of the table with the new schema, replay all writes to both tables, and then atomically swap the tables when the copy is complete. The lock window is only the final swap — milliseconds, not minutes.
These tools are complex and have their own failure modes. They're worth understanding and having in your toolkit, but they're a supplement to the expand-contract pattern, not a replacement for it. gh-ost won't save you from the application-level compatibility problems that expand-contract addresses.
Before running any migration on a production database with significant data:
EXPLAIN or your database's documentation to understand whether the operation requires a lock and at what granularity.For many teams, canary and blue-green are the right tools for big, risky releases. But for everyday deployments — small bug fixes, minor changes, config updates — a simpler approach works well: the rolling deployment.
A rolling deployment replaces servers one at a time (or in small groups), waiting for each replacement to become healthy before continuing. At any point during the rollout, some servers are running the old version and some are running the new version. This is acceptable as long as both versions can coexist — which means both versions must be able to read and write the same data format (back to the expand-contract pattern).
Start: [v1.2] [v1.2] [v1.2] [v1.2] [v1.2] [v1.2]
Step 1: [v1.3] [v1.2] [v1.2] [v1.2] [v1.2] [v1.2]
└─ health check passes ─┘
Step 2: [v1.3] [v1.3] [v1.2] [v1.2] [v1.2] [v1.2]
└─ health check passes ─┘
...
Done: [v1.3] [v1.3] [v1.3] [v1.3] [v1.3] [v1.3]
The rolling deployment's weakness is that if something is wrong with the new version, you might be 50% through the rollout before enough failures accumulate to make the problem obvious. This is why canary (with pre-defined thresholds) is better for riskier changes — it catches problems faster, at lower blast radius.
These techniques are not mutually exclusive. A mature deployment setup uses them in combination:
| Change Type | Recommended Strategy | Why |
|---|---|---|
| Config change | Rolling deploy + feature flag | Low risk, instant rollback via flag |
| Small bug fix | Rolling deploy | Fast, low risk, backwards compatible |
| New feature | Feature flag + canary | Ship the code, release independently, validate in production |
| Database schema change | Expand-contract (multiple deploys) | Safety requires multiple backwards-compatible steps |
| Risky refactor | Dark launch → canary → full rollout | Validate behavior in production before users see it |
| Major version change | Blue-green with soak period | Clean cutover, instant rollback if needed |
One final point that gets skipped in most deployment discussions: a rollback plan you've never tested is not a rollback plan. It's a wish.
Teams write rollback procedures in runbooks and never execute them until the moment they're needed — usually during a high-pressure incident. That's the worst time to discover that the rollback script has a bug, that it takes three times longer than expected, or that rolling back the application code while leaving the new database schema in place causes a different set of errors.
Test your rollback procedures in staging. Time them. Make sure the people on call know how to execute them. The five minutes you spend testing a rollback in a calm environment might save you forty minutes of chaos during an incident.
After every major release that involved a database migration, run a drill: simulate needing to roll back the application code. Does the old code work against the current schema? If it does, great — you've confirmed your expand-contract work held up. If it doesn't, you've found a gap before it became a 3 am problem.