Most engineers treat cost as someone else's problem — ops handles the bill, finance handles the budget. But the decisions that drive cloud spend are made at a keyboard, not in a spreadsheet. The choice of where to store data, how many times a service calls another service, whether you run your own database or use a managed one — these are architectural decisions, and they have price tags.
In this chapter we look at cloud cost from the inside out. We'll go through the hidden data transfer tax that catches most teams off guard, the art of matching storage to access patterns, how to think about the real total cost of open-source software, and how to build a culture where the teams who create cost can also see it.
- Cost is an architecture decision: The biggest cost drivers — how data moves, how storage is used, which services talk to which — are baked in at design time, not at billing time.
- Data transfer is the most commonly overlooked cost: Moving bytes between availability zones, regions, or out to the internet has a price. In data-heavy systems this can dwarf compute costs.
- Storage tiers exist because access patterns differ: Putting everything in hot storage is lazy and expensive. Putting everything in cold storage is cheap and unusable. Match the tier to how often you actually read the data.
- Open-source is never free: The license costs nothing. The engineers who operate it, the incidents they handle, and the on-call rotation that wakes up at 3am — those are the real costs. Managed services often win on total cost of ownership.
- Reserved capacity is an architectural commitment: Committing to reserved instances forces you to understand your baseline load, which makes you a better architect.
- You can't optimize what you can't see: Until cost is attributed to the team and service that caused it, no one has the right incentives to reduce it.
- Performance and cost are not opposites: Caching, batching, and better data models often improve both simultaneously. The cheapest request is the one you never make.
Why Cost Is an Architecture Problem
Here's a common story. A team builds a new feature. It ships. A month later, the cloud bill is up by $40,000. Someone files a ticket with the infrastructure team. The infrastructure team looks at the bill, sees a service that's calling another service thousands of times per second, and sends an email asking what happened. The engineering team says they had no idea it cost that much. Nobody is lying. Nobody was trying to be wasteful. The cost was invisible at design time and only visible after the fact.
This is the normal state in most engineering organizations. And it's a design flaw, not a people flaw.
The decisions that drive cost happen at the keyboard:
- Calling a remote service inside a tight loop instead of batching the calls
- Storing every event in a primary database instead of offloading to cheaper object storage
- Deploying a service to three availability zones when one would do
- Returning full objects when callers only need two fields
- Polling every 5 seconds when a webhook would do
None of these are bad engineering in isolation. But each one has a cost, and if engineers don't know what that cost is — or don't have any visibility into it — they'll make these choices without thinking twice.
The goal of this chapter is to make cost legible. Not to make you obsess over every byte, but to give you enough of a mental model that cost becomes a first-class input to your design decisions, alongside latency, reliability, and maintainability.
The Data Transfer Tax
If there's one cost concept that consistently surprises engineers, it's data transfer. Not storage. Not compute. Moving data.
Cloud providers charge you for bytes that leave their network. The exact numbers vary and change over time, but the structure is consistent across AWS, GCP, and Azure:
| Traffic Type | Approximate Cost | Notes |
|---|---|---|
| Within the same AZ | Free (or near-free) | Same availability zone, same region |
| Cross-AZ within a region | ~$0.01/GB each way | Both sides pay — so $0.02/GB round trip |
| Cross-region | ~$0.02–$0.09/GB | Varies by region pair |
| Egress to the internet | ~$0.08–$0.09/GB | First 1–10 GB/month often free |
| CDN egress | ~$0.01–$0.02/GB | Much cheaper — and often faster |
Let's make this concrete. Suppose you have a service that handles user profile lookups. Each profile response is about 8 KB. Your service runs in us-east-1a but your database read replicas are in us-east-1b (a different AZ) for redundancy. You handle 500 requests per second.
Requests per second: 500
Response size: 8 KB
Data per second: 500 × 8 KB = 4 MB/s = ~345 GB/day
Cross-AZ transfer cost: $0.01/GB each way
Round-trip cost: $0.02/GB
Daily cost: 345 GB × $0.02 = $6.90/day
Monthly cost: ~$207/month
Annual cost: ~$2,500/year
This is just for profile lookups.
Add your other services and you can easily reach $50K–$100K/year in data transfer alone.
This isn't a hypothetical. Teams running microservices architectures with 20–30 services often discover that their data transfer costs rival or exceed their compute costs, because every service call crosses an AZ boundary.
The Three Patterns That Drive Transfer Costs
1. Chatty microservices
When a single user request fans out to 8 downstream services, each of which calls 2–3 more, you're sending hundreds of small HTTP requests per user-facing request. Each one crosses an AZ boundary. The individual cost is tiny. The aggregate cost at scale is not.
The mitigation is not to abandon microservices — it's to be intentional about what goes where. Services that talk to each other constantly should be in the same AZ. Use service mesh features to enforce locality. Prefer aggregating calls at the edge (BFF pattern) over letting each client call 10 different services.
2. Large payloads with small reads
A service returns a 200 KB JSON blob because that's what the schema says it should return. The caller extracts 3 fields and discards the rest. You paid for 200 KB of transfer. You needed about 200 bytes.
GraphQL was partly born from this exact problem — let clients ask for exactly what they need. But you don't need GraphQL to fix it. Sparse fieldsets in REST APIs, projection queries in databases, and deliberate API design that returns lean payloads by default are all effective.
3. Replicated data between regions
Running a multi-region active-active setup means your writes replicate across regions. If your system writes 1 TB per day and replicates to two other regions, that's 2 TB of cross-region transfer daily — roughly $5,800/month at standard AWS rates, just for replication.
This doesn't mean multi-region is wrong. It means it has a cost you need to plan for, and it should factor into your decision about how much data to replicate and how often.
Storage Tiering — Matching Cost to Access Pattern
Cloud object storage (S3, GCS, Azure Blob) has multiple tiers. Engineers often put everything in the standard/hot tier because it's the default and they don't have to think about it. This is expensive.
The economics of storage tiering work because different data has different access patterns. A user's recent transaction history is read frequently. A user's transaction history from three years ago is almost never read, except during an audit. Storing both at the same price makes no sense.
| Tier | Storage Cost (approx) | Retrieval Cost | Retrieval Latency | Right For |
|---|---|---|---|---|
| Standard (Hot) | ~$0.023/GB/month | Negligible | Milliseconds | Data accessed multiple times per week |
| Infrequent Access | ~$0.0125/GB/month | ~$0.01/GB | Milliseconds | Data accessed once a month or less |
| Glacier Instant | ~$0.004/GB/month | ~$0.03/GB | Milliseconds | Data accessed a few times per year |
| Glacier Deep Archive | ~$0.00099/GB/month | ~$0.02/GB + hours wait | Hours | Compliance archives, almost never accessed |
The math here matters. If you have 100 TB of data that's more than 90 days old and rarely accessed, moving it from Standard to Infrequent Access saves roughly:
100 TB = 100,000 GB
Standard: 100,000 × $0.023 = $2,300/month
Infrequent Access: 100,000 × $0.0125 = $1,250/month
Monthly savings: $1,050
Annual savings: $12,600
And that's just one data set.
Lifecycle Policies Are Not Optional
Every object store supports lifecycle policies — rules that automatically move objects between tiers based on age or last-accessed time. Setting these up is not premature optimization. It's basic hygiene.
A simple starting policy for most systems:
Day 0–30: Standard (hot) — active users, recent events, current state
Day 30–90: Standard-IA — still occasionally queried, e.g. reports
Day 90–365: Glacier Instant — mostly dormant, compliance access only
Day 365+: Glacier Deep Archive — pure compliance, assume you'll never read it
The exact thresholds depend on your access patterns. The point is to have a policy and enforce it automatically, rather than letting data accumulate in hot storage indefinitely.
The Retrieval Trap
Here's where teams get burned: they move data to Glacier to save on storage costs, then run a batch job that reads all of it. The storage savings evaporate and then some.
The retrieval cost structure is designed to make cold storage cheap for data you truly rarely touch. If you're accessing data in "archive" tier more than a few times per year, it's probably in the wrong tier.
Storage Tiering for Databases
Tiering isn't just for object storage. It applies to databases too, though the mechanism is different.
The classic problem: you have a Postgres database that holds 5 years of order history. 99% of reads hit orders from the last 90 days. But you're paying for a beefy instance with fast SSD storage to hold all 5 years because they're all in the same table.
The solution is data archival — periodically moving old rows from the primary database to cheaper storage (S3, BigQuery, a read-optimized columnar store). The primary database stays lean and fast. Old data is still queryable, just with higher latency, which is acceptable because nobody's waiting on it interactively.
This pattern has a name: hot-warm-cold data architecture. The key discipline is deciding, upfront, which queries need to hit which tier, and setting service-level agreements accordingly. If someone runs an annual report over 5 years of data and it takes 30 seconds, that's fine. If that query is being run in a user-facing product that expects a 200ms response, you have a mismatch.
Compute Cost — Matching Capacity to Demand
Compute costs are the most visible line on the cloud bill, but they're often not the biggest lever. Still, there's significant money left on the table in most organizations' compute purchasing strategy.
The Three Purchasing Models
Cloud providers offer the same compute at very different prices depending on how you commit.
| Model | Relative Cost | Commitment | Best For | Risk |
|---|---|---|---|---|
| On-Demand | 100% (baseline) | None | Unpredictable or short workloads | None — just pay more |
| Reserved (1yr) | ~40% cheaper | 1 year | Stable baseline load | Stranded cost if load drops |
| Reserved (3yr) | ~60% cheaper | 3 years | Core infrastructure | High stranded cost risk |
| Spot / Preemptible | ~70–90% cheaper | None (can be reclaimed) | Batch, ML training, fault-tolerant jobs | Nodes can disappear mid-job |
The right strategy for most production systems is a mix:
- Reserved instances for your predictable baseline — the capacity you're running at 2am on a Tuesday. This is your floor.
- On-demand for bursting above the baseline during peak hours.
- Spot/Preemptible for background batch workloads that can tolerate interruption — nightly ETL jobs, model training, index builds.
The discipline required to use reserved instances well is the same discipline required to understand your traffic patterns. If you don't know your baseline load, you can't commit to reserved capacity responsibly. Committing too much leaves you paying for idle capacity. Committing too little means you're leaving 40–60% savings on the table.
Right-Sizing — The Unsexy Savings
Most organizations are running instances that are 2–3x larger than they need to be. This happens for understandable reasons:
- An engineer picked the size conservatively during initial deploy and nobody revisited it
- A service was migrated from a physical server to a cloud instance with "at least as much RAM"
- CPU and memory utilization metrics weren't set up, so nobody noticed the machine was 15% utilized
Right-sizing is about looking at actual CPU and memory utilization over time (not just the peak you saw once during an incident) and downsizing or consolidating instances to match real demand. A service running at 10% CPU on a 32-core machine could probably run on a 4-core machine with headroom to spare.
The reason this is "unsexy" is that it requires saying "let's make this smaller" and then waiting nervously to see if it breaks. Nobody gets promoted for right-sizing. But it's one of the most straightforward ways to cut a cloud bill by 20–30%.
Build vs. Buy — The True Cost of Open-Source
Open-source software has no license fee. This leads to a dangerously incomplete mental model: "Kafka is free, so it costs us nothing to run our own Kafka cluster."
The license is the smallest part of the total cost of ownership (TCO). Let's build a real picture.
A Concrete Example: Self-Managed Kafka vs. Managed Kafka
Suppose your team needs a message queue handling 500 MB/s peak throughput. You're evaluating running your own Apache Kafka cluster versus using a managed service like Confluent Cloud or Amazon MSK.
Infrastructure:
6 broker nodes (r5.2xlarge, reserved 1yr): ~$800/month each = $4,800/month
3 ZooKeeper nodes: ~$300/month total
Storage (10 TB, gp3): ~$800/month
Cross-AZ replication traffic: ~$400/month
Total infrastructure: ~$6,300/month
Engineering time:
Initial setup + tuning (one-time): ~3 weeks × 2 engineers
Ongoing operations (patching, tuning,
capacity planning, incident response): ~0.3 FTE/year
At $200K all-in engineer cost: $60,000/year = $5,000/month
Incidents:
Major Kafka incident per quarter (avg): ~2 days of senior engineer time
4 × 2 days × $1,000/day (opportunity cost): $8,000/year = ~$670/month
Total self-managed per month: ~$12,000/month
Total self-managed per year: ~$144,000/year
For 500 MB/s peak, moderate retention: ~$8,000–$12,000/month
Engineering time:
Setup + integration: ~1 week × 1 engineer (one-time)
Ongoing operations: ~0.05 FTE/year = $830/month
Total managed per month: ~$10,000–$13,000/month
Total managed per year: ~$120,000–$156,000/year
The numbers are surprisingly close. Sometimes managed is cheaper all-in. Sometimes self-managed wins. What the comparison makes clear is that engineering time is almost always the dominant cost, and it's the one that's most often left out of the analysis.
When Self-Managed Makes Sense
Self-managing infrastructure is not always wrong. There are legitimate reasons to do it:
- Control requirements: Your data cannot leave your VPC for compliance reasons. You need features not available in the managed offering. You need custom tuning beyond what the managed service exposes.
- Scale economics: At very large scale — tens of TB/day, hundreds of nodes — managed service pricing becomes extreme and the economics flip. Companies like Netflix, Uber, and Airbnb run their own infrastructure partly because they're big enough to amortize the engineering cost.
- Deep expertise already exists: If you already have 3 engineers who are Kafka experts and a proven runbook, the marginal cost of operating Kafka is much lower than if you're starting from scratch.
When Managed Makes Sense
For most teams, most of the time, managed services win:
- You're small: A 5-engineer team should not be running its own database cluster, message queue, and search infrastructure. The ops burden crowds out product work.
- Your expertise is elsewhere: A company whose core competency is payments should not be building deep Kafka expertise unless the queue is genuinely central to their technical edge.
- Reliability requirements are high: Managed services have dedicated reliability teams. Unless you're prepared to match that, the managed service's SLA is probably better than yours.
Cost and Performance Are Not Opposites
A common misconception: making a system faster always makes it more expensive. This is sometimes true, but often the opposite is true. Many optimizations simultaneously reduce latency and reduce cost.
The Cheapest Request Is the One You Don't Make
Caching a database response doesn't just make reads faster — it also means fewer database calls, which means less compute on the database server, which means you can run a smaller or fewer database instances. A cache that handles 80% of reads is not just a latency optimization. It's a cost multiplier on your database fleet.
Similarly, batching writes instead of writing one row at a time reduces database connection overhead, reduces per-request latency, and often allows you to use a smaller database instance.
Expensive Fan-Out Patterns
Some architectural patterns are both slow and expensive. The N+1 query is the most famous example: fetch 100 users, then for each user make a separate database call to fetch their profile. That's 101 queries instead of 1. 101 database round-trips instead of 1. 101× the cost.
The same pattern appears at the service level. A service that handles one request by making 20 downstream calls is slow (those calls are often serial or partially serial) and expensive (20× the network and compute cost per user request).
Fixing fan-out patterns — through batching, caching, data denormalization, or rethinking the data model — usually improves both latency and cost at the same time.
Compression — Cheap and Often Forgotten
Enabling HTTP/gRPC response compression costs a few CPU cycles on the server and reduces payload size by 60–80% for typical JSON payloads. That means 60–80% less data transfer cost on every API call.
Enabling compression on Kafka messages, S3 objects, and database backups can similarly slash storage and transfer costs. The CPU cost of compression (and decompression) is almost always less than the storage and transfer savings.
Uncompressed JSON response: 12 KB average
After gzip compression: 2 KB average (83% reduction)
At 1,000 requests/second:
Without compression: 12 KB × 1,000 × 86,400s = ~1 TB/day egress
With compression: 2 KB × 1,000 × 86,400s = ~170 GB/day egress
Egress savings per day: ~830 GB × $0.09/GB = $74.70/day = $2,240/month
This is from enabling one flag.
FinOps — Making Cost Visible
FinOps is the practice of bringing financial accountability to cloud spending. The core idea is simple: the engineers who make architectural decisions should be able to see the cost consequences of those decisions.
In practice, most organizations have a gap between the people who cause cost and the people who see it. Finance sees the bill. Engineering sees the architecture. The bill doesn't say "this is expensive because service X calls service Y 10 million times per day." It says "data transfer: $42,000."
Cost Allocation Tags — The Foundation
Every cloud resource should be tagged with at minimum: the team that owns it, the service it belongs to, and the environment (production, staging, dev). These tags flow into your billing reports and let you answer "how much does the recommendations service cost per month?"
Without tags, you have a total cloud bill. With tags, you have per-service cost attribution. Per-service cost attribution is what gives teams ownership over their spend.
team: payments
service: checkout-api
environment: production
cost-center: eng-checkout
Unit Economics — The Right Level of Abstraction
A raw dollar amount per service is useful. A cost-per-unit metric is more useful because it scales.
Cost per 1,000 API requests. Cost per active user per day. Cost per GB of data processed. These metrics let you answer questions like: "We grew 2× this quarter, but our cloud bill grew 3×. Something in our cost structure isn't scaling linearly." That's a question you can investigate and fix. "Our cloud bill is $500K this month" is a number you can only stare at.
Cost Anomaly Detection
AWS, GCP, and Azure all offer cost anomaly detection — alerts when a service's spend spikes unexpectedly. Setting these up takes 10 minutes. They catch things like: a misconfigured autoscaler that spun up 500 instances instead of 5; a new feature that does a full table scan on every request; a job that was supposed to run once but got triggered in a loop.
The goal is to catch a $10,000/day mistake on day one rather than day thirty. Without anomaly detection, you discover these problems when the monthly bill arrives.
The Cultural Side
FinOps only works if cost is part of the engineering culture, not a separate finance exercise. A few practices that help:
- Add cost to architecture reviews. When someone proposes a design, ask: "What does this cost at 10× our current load?" It doesn't have to be exact. Back-of-envelope is fine. The goal is to make cost legible before a decision is made, not after.
- Put cost metrics on team dashboards. Latency and error rate aren't the only things worth monitoring. Cost per request, normalized by traffic, should be a line on the team's dashboard. When it moves, someone should be curious why.
- Celebrate cost savings like feature launches. A team that cuts their cloud bill by $30K/month while maintaining SLOs did something genuinely valuable. If the only thing that gets celebrated is shipping features, cost optimization will always be deprioritized.
- Shared savings incentives. Some organizations redirect a portion of documented cloud savings back to the engineering team's budget. This creates a direct incentive that aligns engineering behavior with financial outcomes.
Cost at Design Time — A Mental Checklist
The best time to think about cost is before you build something. Not obsessively — you don't need to calculate the exact bill for every design. But a quick mental pass over the following questions surfaces the expensive surprises before they happen.
- Where does data flow? Does it cross AZ boundaries? Region boundaries? Exit to the internet? Sketch the data flow and count how many hops involve billable transfer.
- How big are the payloads? Are we sending full objects when we only need a few fields? Can we compress? Can we reduce payload size by design?
- What's the read/write ratio and how does it scale with traffic? If user count doubles, does data transfer double? Does storage double? Does compute double? Non-linear scaling is worth catching early.
- What access pattern does this data have? Will it be read frequently, infrequently, or almost never? Is it going into the right storage tier?
- Are we building this or buying it? If building, what's the honest engineering time estimate, not just for initial build but for ongoing operations?
- What happens at 10× scale? Cost structures that are tolerable at current scale can become catastrophic at 10× if they have bad scaling properties.
Cost is determined by architecture, not by the billing team. The engineers who decide how data moves, where it's stored, and how services communicate are the people who decide the cloud bill — usually without knowing it. Making cost a first-class design input is the only way to avoid discovering expensive mistakes after they're too late to fix cheaply.
Treating open-source software as free because it has no license fee. The license is the smallest part of the cost. Engineering time to operate, oncall burden, and incident recovery are the real costs — and they're large enough to make managed services cheaper than self-managed ones in most cases, once you account for everything honestly.
- Draw the data flow of this design and mark every boundary where bytes are billed for transfer. What does this cost at 10× current traffic?
- For every data store in this design: what is the realistic access pattern 12 months after launch? Is it in the right storage tier, or are we paying hot-storage prices for cold data?
- If we're self-managing any piece of infrastructure in this design, what is the full TCO including engineering time, oncall burden, and incident cost — and does it still win over the managed alternative?