Observability — You Can't Fix What You Can't See
How to build distributed systems that tell you what's wrong, where it is, and why it happened — before your customers notice.
What's in this chapter
This chapter is about one central idea: a system you cannot observe is a system you cannot trust. You can write the cleanest code in the world, but if you can't tell what it's doing in production, you are flying blind.
We'll start by understanding the difference between monitoring and observability — they are not the same thing, and confusing them is expensive. Then we'll go deep on the three foundational signals: metrics, logs, and traces. We'll cover how each one works, what it's good for, where it falls short, and the common traps teams fall into. We'll add a fourth signal — continuous profiling — that most teams ignore until they wish they hadn't. And we'll close with how to think about observability not as something you add after the system is built, but as a first-class design concern.
Key Learnings
Read this first. Come back to the full chapter for the why.
- → Monitoring asks known questions. Observability lets you ask questions you didn't know you'd need to ask. You need both, but they are different tools with different designs.
- → Metrics are cheap and fast but low resolution. They tell you that something is wrong. They rarely tell you why. A spike in error rate is a symptom — metrics alone won't show you the cause.
- → Cardinality is the silent killer of metric systems. Every unique combination of label values creates a new time series. High-cardinality labels like user IDs or request IDs will bring down your metrics backend. Choose labels carefully.
- → Averages lie. Use percentiles. If p50 latency is 10ms but p99 is 4 seconds, your average might show 50ms and look "fine." The 1% of users experiencing 4s waits are real people with a real problem.
- → Logs must be structured from the start. Free-text log lines are only readable by humans. Structured logs (key-value pairs, JSON) are readable by machines, queryable, aggregatable, and orders of magnitude more useful at scale.
- → A single trace is worth a thousand log lines for diagnosing cross-service latency. Distributed tracing lets you see a request's entire journey — which service was slow, which was down, which made an N+1 call.
- → Naive 1% trace sampling hides the bugs that matter most. Head-based sampling (decide at the start) misses rare slow requests. Tail-based sampling (decide at the end) is harder to build but finds exactly the traces you need.
- → Continuous profiling finds the bottleneck metrics can only hint at. A CPU spike in metrics tells you something is slow. A flame graph tells you it's this specific function in this specific call path.
- → Alert on symptoms, not causes. Alert when users are affected — high error rate, high latency, failed jobs. Don't alert on CPU at 90% if users aren't feeling it yet. Every false-alarm page erodes trust.
- → Observability is a design decision, not a deployment decision. You cannot retrofit good observability onto a system that was built without it. The hooks, the IDs, the structured context — they have to be designed in from day one.
The Problem With "It Was Fine Yesterday"
Here is a scenario that happens at almost every company with a distributed system. It's 2am. Your on-call phone rings. The alert says "Error rate elevated." You log into your dashboards. The error rate is indeed elevated. Now what?
You look at the error rate chart. It's up. You look at latency. Also up. You check the recent deployment. Nothing in the last 6 hours. You check CPU and memory on your services. All normal. You grep through logs, but you're looking at three different services, producing thousands of lines per second, and you're searching for something you can't fully describe yet. Thirty minutes in, you find a suspicious stack trace. Is this the cause or a symptom? You're not sure.
This is what happens when a system is monitored but not observable.
Monitoring is the practice of collecting predefined signals and alerting when those signals cross predefined thresholds. It answers questions you already know to ask: "Is error rate above 1%?" "Is CPU above 80%?"
Observability is the property of a system that lets you understand its internal state by examining its external outputs. It answers questions you didn't know you'd need to ask: "Why did this specific customer's request fail at 14:32:07 on Tuesday?"
Monitoring is about known unknowns. Observability is about unknown unknowns. You need both. But they require very different designs.
A monitored system tells you that something broke. An observable system tells you where, why, and for whom. The difference between these two is often the difference between a 5-minute incident and a 3-hour one.
The Four Signals
Observability in distributed systems is built on four types of data. They are complementary — each answers a different question, and each has failure modes the others compensate for.
In practice, an incident investigation usually follows this path: metrics alert you that something is wrong → traces narrow it to a specific service and request pattern → logs give you the exact error details → profiles (if needed) tell you why the code is behaving that way.
Signal 1 — Metrics
A metric is a number that changes over time. Request count, error count, latency, memory used, queue depth — these are all metrics. They are the cheapest signal to collect and store, and the fastest to query. For this reason, metrics are almost always the first signal you check during an incident.
What a Metric Actually Is
Every metric has three parts: a name, a value, and a set of labels (also called tags or dimensions). Labels are key-value pairs that let you slice the metric. For example:
# name: http_requests_total
# value: 1427
# labels: method=POST, endpoint=/checkout, status=200, region=us-east-1
http_requests_total{
method="POST",
endpoint="/checkout",
status="200",
region="us-east-1"
} 1427
Labels are what make metrics useful. Instead of one number that says "there were 1427 requests," you can ask "how many POST requests to /checkout succeeded in us-east-1 in the last 5 minutes?" Labels give you that slicing power.
The Cardinality Problem
Here is the most important constraint in metric systems, and the one that bites teams hardest: every unique combination of label values creates a separate time series.
If you have 3 methods × 50 endpoints × 10 status codes × 5 regions, that's 7,500 time series for one metric. That's fine. Your metrics backend can handle it.
Now imagine a new engineer on your team adds a label for user_id — seems reasonable,
right? You have 2 million users. Suddenly that same metric has 2,000,000 × 3 × 50 × 10 × 5 =
15 billion time series. Your Prometheus instance falls over. Your on-call pager goes off for a
completely different reason.
Never use these as metric labels: user IDs, request IDs, session IDs, order IDs, trace IDs, IP addresses, email addresses, or any field with unbounded unique values.
High-cardinality data belongs in logs and traces, not metrics. Metrics are for aggregated, bounded dimensions only.
The Four Metric Types
Every metric system distinguishes between a small set of fundamental types. Understanding them prevents the most common mistake: treating a counter like a gauge.
| Type | What it represents | Example | Common mistake |
|---|---|---|---|
| Counter | A value that only goes up. Resets to zero on process restart. | Total requests served, total errors | Graphing the raw counter instead of the rate. Always use rate() or increase(). |
| Gauge | A value that goes up and down freely. | Current memory used, queue depth, active connections | Using a gauge for things that can never go down, where a counter with rate makes more sense. |
| Histogram | Distribution of values across pre-defined buckets. Lets you calculate percentiles. | Request latency, response size | Using too few buckets (can't tell apart p90 and p99) or wrong bucket boundaries. |
| Summary | Pre-calculated percentiles on the client side. | Request latency (p50, p90, p99) | Cannot aggregate across multiple instances — use histograms instead when you have multiple replicas. |
The RED Method — What to Measure for Every Service
There are many frameworks for deciding what to instrument. The most practical one for most services is the RED method, introduced by Tom Wilkie at Weaveworks. Three metrics for every service, every endpoint:
R — Rate: How many requests per second is this service handling?
E — Errors: What fraction of those requests are failing?
D — Duration: How long are requests taking? (as a distribution, not an average)
If all three of these look healthy for a service, the service is probably fine from the user's perspective. If any one of them degrades, something is wrong.
Google calls a similar set the Four Golden Signals: Latency, Traffic, Errors, and Saturation. Saturation is the one RED omits — it measures how "full" a service is (CPU, queue depth, connection pool usage). Both frameworks are good. Pick one and apply it consistently.
Average latency is almost useless. Consider a service where 99% of requests complete in 5ms and 1% take 5,000ms. The average is about 55ms. That looks acceptable. Your alert doesn't fire. But 1% of your users — potentially millions of requests per day — are waiting 5 seconds.
Always track and alert on percentiles: p50 (median), p95, and p99 at minimum. p99.9 for systems where tail latency is a customer-facing concern (payments, search).
Push vs. Pull
There are two fundamental models for how metrics get from your service to your metrics backend.
In the pull model (used by Prometheus), the metrics backend periodically scrapes an HTTP endpoint that your service exposes. Your service doesn't need to know where the metrics system lives. The scraper discovers services through service discovery (e.g., querying Kubernetes for all pods with a specific label). This model is simple, observable itself (you can curl the endpoint), and works well when the metrics backend controls the collection schedule.
In the push model (used by StatsD, InfluxDB, Datadog), your service sends metrics to a central collector as they happen. This works well for short-lived processes like batch jobs that might finish before a pull-based scraper could reach them. It also works better in environments where services can't expose an HTTP listener (certain function-as-a-service setups, for example).
Most large-scale systems use some combination. The mental model: pull for always-on services, push for ephemeral workloads.
Signal 2 — Logs
A log is a record of something that happened. Every system produces logs. Most of those logs are useless during an incident because they were designed to be read by a human sitting in a terminal, not queried by a machine at 3am.
The single most impactful change you can make to your logging strategy is moving from unstructured to structured logs.
Unstructured vs. Structured
An unstructured log line looks like this:
2024-11-14 14:32:07 ERROR Failed to process payment for order 8841923: timeout after 3002ms (retry 2/3)
A human can read this. But a machine can't reliably parse it. If you want to find all orders that timed out on retry 2, you're writing a regex. If the message format changed three months ago, your regex breaks silently. If you want to graph timeout frequency over time, you're writing a fragile log parser that will break the moment someone changes the log message.
A structured log line looks like this:
{
"timestamp": "2024-11-14T14:32:07.341Z",
"level": "error",
"event": "payment_processing_failed",
"order_id": "8841923",
"reason": "timeout",
"duration_ms": 3002,
"retry": 2,
"max_retries": 3,
"service": "payment-service",
"trace_id": "4bf92f3577b34da6"
}
Now you can query: event="payment_processing_failed" AND reason="timeout" AND retry=2.
You can count events by reason, graph retry rates over time, and — critically — join this log
entry to a trace using trace_id to see the full picture of what the request was doing.
The single most valuable thing you can do to connect your signals together is to propagate a
trace_id (or request_id, correlation_id) through every
log line, every metric label (with care for cardinality), and every trace span. When an incident
starts with a metric spike, you can jump to a sample trace ID, and from there jump directly to
every log line that was part of that request — across all services.
Log Levels Are a Contract
Log levels are not just decoration. They are a promise to the person who will be woken up at 2am. Think of them this way:
| Level | What it means | Who reads it |
|---|---|---|
| ERROR | Something failed that should not fail. A human should look at this. | On-call engineer, alerting system |
| WARN | Something unexpected happened but the system recovered or degraded gracefully. Worth knowing, not urgent. | On-call engineer (next morning) |
| INFO | Normal operations worth recording — a user logged in, a job started, a payment processed. | Engineers investigating behavior |
| DEBUG | Fine-grained detail useful during development. Should never run in production at scale. | Developers, temporary incident investigation |
The most common mistake: using ERROR for anything the developer finds surprising, including expected errors like a user entering a wrong password. If you log every 401 Unauthorized as an ERROR, your error logs become noise. The on-call engineer learns to ignore the ERROR level. Then a real error happens and they don't notice.
A useful rule: if a human shouldn't be woken up for it, it should not be at the ERROR level.
Sampling Logs at High Volume
At high traffic, logging every event at INFO level is expensive. A service handling 100,000 requests per second that logs one line per request is producing 100,000 log lines per second. That's a lot to store, a lot to ship, and a lot to query.
A few techniques help here. First, log sampling: for high-volume, low-value events (like a health check endpoint being polled every 10 seconds), log 1 in 100 or 1 in 1000. Track the sample rate in the log line so you can extrapolate actual counts.
Second, dynamic log level control: keep your service at INFO level by default, but allow operators to temporarily elevate a specific service to DEBUG during an incident without a code deploy. This is enormously valuable and costs almost nothing to build.
Third, always log ERROR without sampling. The cost of missing an error log is higher than the cost of storing it.
Signal 3 — Distributed Traces
In a monolith, a request lives inside one process. When something goes wrong, you look at that process's logs. In a distributed system, a single user request might touch 10 or 15 services before returning a response. When it fails, which service was responsible? When it's slow, where did the time go?
Metrics can tell you that Service A's latency went up. Logs can tell you what Service A saw. Neither can tell you the full story of the request's journey across the system. Distributed tracing can.
How Tracing Works
When a request enters your system, you generate a unique trace ID — a random identifier for this entire request. Every service that touches the request gets this trace ID (propagated in HTTP headers, gRPC metadata, or message queue headers).
Each service records its own work as one or more spans. A span represents a unit of work: "Service B processed this request from time T1 to time T2." Spans know their parent span (who called them), so the whole thing forms a tree — the trace tree.
This view — called a Gantt chart or waterfall view — immediately shows you several things that no metric or log would easily reveal: the product-service is the bottleneck, it's making 4 sequential database queries (an N+1 pattern), and there's an expensive cache miss. You know exactly where to look.
What to Put in a Span
A span captures more than just start time and end time. Good spans include:
- Span name — a human-readable description: "HTTP POST /checkout", "DB query: users.find_by_email"
- Status — OK, error, or timeout
- Attributes — relevant key-value pairs: http.status_code=200, db.statement="SELECT...", user.id (if appropriate for your privacy posture)
- Events — timestamped notes within the span: "cache miss", "retrying", "acquired lock"
- Links — references to other traces (e.g., an async job that was triggered by this request)
Context Propagation — The Hard Part
The technical challenge in distributed tracing is not collecting spans — any library can do that. The hard part is context propagation: making sure the trace ID travels with the request through every hop, including ones you don't control.
HTTP calls are easy — there are standard headers (traceparent from the W3C Trace
Context standard, X-B3-TraceId from the older Zipkin standard). Most tracing
libraries handle this automatically.
The holes in your trace graph usually come from: message queues (did you put the trace ID in the message metadata?), batch jobs (a nightly job triggered by a previous request — does it carry the original trace ID?), third-party services (they won't propagate your headers), and async callbacks (the trace ID must be stored alongside the callback registration, not just in memory at call time).
OpenTelemetry (OTel) is an open standard and SDK for instrumentation that covers metrics, logs, and traces. It has good language support (Go, Java, Python, Node.js, Ruby, .NET), and it's vendor-neutral — you instrument once and can send to Jaeger, Zipkin, Honeycomb, Datadog, or any OTel-compatible backend.
The main reason to use it: you do not want to re-instrument your entire codebase when you switch observability vendors. OTel gives you that portability.
Sampling — The Decision That Shapes What You See
A large service might handle hundreds of thousands of requests per second. Recording a trace for every request would cost a fortune in storage and processing, and most of those traces would show exactly the same healthy behavior. You need to sample — but how you sample determines which problems you can find.
Head-Based Sampling
In head-based sampling, the decision to trace a request is made at the very start — when the first service receives it. If you sample 1% of requests, you roll a die when the request arrives, and that decision is propagated to all downstream services: they either all trace or all skip.
This is simple and cheap. But there is a fundamental problem: you make the decision before you know anything about how the request will behave. A slow request, an errored request, a request that hits a rare code path — these have the same 1% chance of being captured as a completely normal request. The bugs that are hardest to find are often in the rarest requests, and they are systematically undersampled.
Tail-Based Sampling
In tail-based sampling, you buffer the spans from all services as the request flows through the system. After the request is complete, you look at the full trace and decide whether to keep it: Was it slow? Did it error? Did it take an unusual path? If yes, keep it. If it was completely normal, drop it.
This is how you find the 1% of requests that are failing, the 0.1% that are anomalously slow, and the edge cases you didn't anticipate. The downside is cost and complexity: you need a component (a tail-sampling proxy or collector) that buffers spans in memory, waits for the request to complete, evaluates the sampling rules, and then flushes or drops.
For most teams, a pragmatic middle ground works well: always sample errors and high-latency requests at 100%, and sample normal successful requests at a low rate (0.1%–1%). This gives you full coverage of failures and a statistically valid view of normal behavior.
# Tail-sampling rule example (OpenTelemetry Collector config)
tail_sampling:
decision_wait: 10s # wait up to 10s for all spans to arrive
policies:
- name: always-sample-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: always-sample-slow
type: latency
latency: { threshold_ms: 1000 }
- name: sample-normal-at-1-percent
type: probabilistic
probabilistic: { sampling_percentage: 1 }
Signal 4 — Continuous Profiling
Metrics tell you that CPU is high. Traces tell you which service is slow. But neither tells you why the code is slow. That's what profiling is for.
Traditional profiling means attaching a profiler to a specific process at a specific moment, running a load test, and analyzing the output. It's a manual, offline activity. Continuous profiling is different: it runs in the background, all the time, in production, with low enough overhead that you don't need to turn it off.
How It Works
A continuous profiler interrupts your running process at a fixed frequency — say, 100 times per second — records the current call stack, and returns control. Over time, these stack samples accumulate into a statistical picture of where your CPU time is actually going. Functions that appear on more stack samples are using more CPU.
The output is a flame graph — a visualization where the x-axis is time (or sample count), the y-axis is the call stack depth, and the width of each block is proportional to how much time was spent in that function including its callees. The wide blocks at the top of the flame are the bottlenecks.
Imagine a service where Prometheus shows a CPU spike every day between 2am and 3am. No obvious cause. No errors. Just high CPU. An on-call engineer is woken up. They check metrics, check logs — nothing explains it.
A flame graph from continuous profiling shows that during that hour, 40% of CPU is being
spent inside JSON.parse in a function called deserialize_config.
Further investigation: a background job is reading a configuration file, deserializing it
into an object, using one field, and discarding it — once per request, 10,000 times a minute,
instead of caching it once at startup.
The fix is 3 lines of code. Without the profile, the team would have been looking for the cause for hours.
Continuous profiling is the least adopted of the four signals, which is a shame because it's often the fastest path from "something is slow" to "here is exactly why." Tools worth knowing: Pyroscope (open source), Parca (open source), Datadog Continuous Profiler, Google Cloud Profiler.
Alerting That Doesn't Cry Wolf
Collecting signals is the first half of observability. The second half is routing the right signals to the right people at the right time. This is alerting, and most teams do it badly.
Alert Fatigue Is a System Property
Alert fatigue is what happens when engineers receive too many alerts that don't require action. They learn to dismiss alerts without reading them. The alert that actually matters gets dismissed too. You've seen this — an on-call rotation where everyone mutes their phone by week two.
Alert fatigue is not a people problem. It is a design problem. The alert system was designed to produce noise, and humans adapted to it the only way they could.
The instinct is: alert on potential problems early, so the on-call has time to react before things get worse. So you alert on CPU > 70%, disk > 80%, memory > 75%, queue depth > 1000.
The problem: most of these thresholds fire regularly during normal traffic patterns. CPU at 72% on a Tuesday morning might be completely fine. The on-call acknowledges it, sees nothing wrong, and learns to ignore CPU alerts. Now CPU at 98% — a real problem — also gets ignored.
Alert on Symptoms, Not Causes
The right mental model for alerting: alert when users are affected, not when resources are stressed.
A user is affected when:
- → Error rate on user-facing endpoints is above your SLO threshold
- → p99 latency on critical paths (checkout, login, search) exceeds your SLO
- → A scheduled job that users depend on has not completed by its deadline
- → Payment processing success rate drops below a threshold
Notice: none of these say "CPU" or "memory" or "disk." Those are causes, not symptoms. High CPU might cause high latency — but you should alert on the latency, not the CPU. If CPU is high but latency is fine, there is no user impact and no reason to wake anyone up.
Causes — CPU, disk, queue depth — are useful for dashboards and postmortem analysis. They are generally not useful as alert triggers.
Burn Rate Alerting — The Right Way to Alert on SLOs
If your SLO says "99.9% of requests complete successfully," you have an error budget of 0.1%. Over a 30-day window, that's about 43 minutes of allowed downtime or 0.1% of your requests.
The naive approach: alert when error rate exceeds 0.1%. The problem: if your error rate is 0.11% for an entire month, you'll burn through your budget slowly and your alert will fire constantly at a low, annoying rate.
A better approach: burn rate alerting. Alert when your error budget is being consumed faster than sustainable. A 1x burn rate means you'd use up the budget in exactly 30 days. A 14x burn rate means you'd use it up in 2 days.
# Alert when burning through error budget 14x faster than sustainable
# i.e., in 2 days you'd exhaust a 30-day budget
alert: HighErrorBudgetBurnRate
expr: (
rate(http_requests_errors_total[1h])
/
rate(http_requests_total[1h])
) > (14 * 0.001) # 14x burn rate × 0.1% target error rate
for: 5m
severity: critical # page someone now
Google's SRE workbook describes a multi-window burn rate approach — combining a fast window (for immediate spikes) and a slow window (for gradual degradation) — that reduces both false positives and alert latency simultaneously. It's worth implementing for any service with a formal SLO.
Every Alert Should Have a Runbook
An alert without a runbook is an alarm clock without a snooze button. The on-call engineer wakes up, sees the alert, and has to figure out what to do from scratch — every time.
A runbook doesn't have to be long. It needs to answer three questions:
- What does this alert mean? What state is the system in when this fires?
- What is the impact? Which users are affected and how?
- What should I do? A numbered list of steps, starting with diagnosis and ending with mitigation or escalation.
Link the runbook directly in the alert. On-call at 3am, half asleep, does not want to open a wiki and search for the right page.
Connecting the Signals — Making Them Work Together
The real power of observability is not any single signal — it's the ability to move fluidly between them during an investigation. Here's what that looks like in practice.
Each signal handoffs to the next. Metrics give you the "what" and the "when." Traces give you
the "where" in the system. Logs give you the specific error detail. The trace_id
is the key that connects them.
This investigation would have taken an hour or more with only logs, and might never have found the connection pool root cause with only metrics.
Observability-Driven Development
Most teams treat observability as something you add after the system is built. You write the service, you deploy it, and then you add some dashboards. This is backwards.
The problem with retrofitting observability is that the information you need is often not there. Log lines don't have the context you need. Spans don't have the right attributes. There's no trace ID in the right places. Fixing this requires touching the code everywhere, and the engineer who built it may not be around.
Observability-driven development means asking, before you ship any feature: "If this breaks in production, how will I know? How will I diagnose it?" If you can't answer both questions, you're not done.
A Practical Pre-Ship Checklist
Before any new service or major feature goes to production, verify:
OBSERVABILITY PRE-SHIP CHECKLIST
Metrics
✓ RED metrics instrumented (rate, errors, duration as histogram)
✓ No high-cardinality labels (no user IDs, request IDs)
✓ Business metrics tracked (orders created, payments processed)
✓ Metrics visible in a dashboard
Logs
✓ Structured JSON logging (not free-text)
✓ trace_id propagated and included in every log line
✓ No PII in logs (email, password, SSN, payment details)
✓ ERROR level reserved for actionable failures
✓ Log levels tunable at runtime without redeploy
Traces
✓ All inbound and outbound calls produce spans
✓ Spans have meaningful names and relevant attributes
✓ Context propagated through async calls and queues
✓ Sampling configured (errors and slow requests at 100%)
Alerts
✓ Alert defined for user-visible error rate
✓ Alert defined for p99 latency
✓ Every alert has a linked runbook
✓ Alerts tested (does the alert actually fire in staging?)
The "Canary Operator" Mindset
When you deploy a new version, you should be watching your observability signals before the deployment team declares it done. Not just checking that error rate is zero — actively watching latency percentiles (p50, p95, p99), comparing them against the previous version, watching business metrics (did conversion rate drop?), and being ready to roll back if anything drifts.
The time from "deploy initiated" to "deployment declared healthy" should have a human watching dashboards for at least 5–10 minutes for any change to a critical path. Automated canary analysis — where a tool watches the signals and makes the rollout/rollback decision — is even better once you have the data quality to trust it.
Common Mistakes and How to Avoid Them
Mistake 1 — Treating the Dashboard as the System's Health
A dashboard that looks green is not the same as a system that is healthy. Dashboards show the metrics you thought to instrument. The problem might be in a metric you didn't think to instrument. "The dashboard looks fine" is not a valid answer to "is the system healthy?" during an incident. Always verify with real user data or synthetic probes.
Mistake 2 — Logging Sensitive Data
Logs, traces, and metrics are often stored in systems with broader access than your production database. Engineers searching logs during an incident don't need to see full credit card numbers, passwords, SSNs, or health information. Beyond the security risk, in most jurisdictions, logging PII creates compliance obligations.
Build log scrubbing into your logging library at the framework level — not as something individual engineers have to remember per log line. Define a list of fields that are always redacted or hashed before they leave the process.
Mistake 3 — Too Many Dashboards, No Standard
Over time, teams accumulate dashboards the way code accumulates comments — promiscuously, without deletion. You end up with 200 dashboards and no one knows which one is authoritative. During an incident, people pull up different dashboards and see different pictures of the system.
Designate a small number of canonical dashboards: one overview per service (the RED metrics), one infrastructure dashboard, one business metrics dashboard. Everything else is exploratory and can be deleted when the person who built it leaves.
Mistake 4 — Observability as a Cost Center
Teams cut observability costs during budget pressure and then pay for it during the next major incident. The correlation is direct and well-documented: companies that invest in observability have shorter incident times, lower customer impact, and lower total cost of incidents.
A useful way to think about the budget conversation: how long does a Severity 1 incident cost your company per hour? Whatever that number is, good observability that cuts diagnosis time from 2 hours to 15 minutes pays for itself in the first major incident.
Chapter Summary
Observability is not a feature you add after the system is built — it's a property you design in from the start. A system that can't explain its own behavior will eventually fail in a way you can't diagnose, at the worst possible time.
What We Covered
- Monitoring vs. observability — a critical distinction
- Metrics: types, cardinality trap, RED method
- Why averages lie and percentiles tell the truth
- Structured logging and why it matters
- Distributed tracing: spans, context propagation
- Head-based vs. tail-based sampling trade-offs
- Continuous profiling and flame graphs
- Alerting on symptoms, burn-rate SLO alerting
- Pre-ship observability checklist
The Most Common Mistakes
- Using average latency instead of percentiles
- Adding high-cardinality labels to metrics
- Unstructured logs that can't be queried
- Head-based sampling that hides rare failures
- Alerting on CPU/memory instead of user impact
- Alerts without runbooks
- Skipping observability until after the launch
- Logging PII in plain text
Three Questions for Your Next Design Review
- If this service starts returning errors for 1% of users starting right now, how long before your on-call engineer knows about it, and what will they see?
- Can you answer "which downstream call is responsible for this request being slow" without reading source code or deploying new instrumentation?
- When this alert fires at 3am, can the on-call engineer — who has never seen this service before — diagnose and mitigate the problem using only the runbook?