Part VII  ·  Operational Efficiency  ·  Chapter 26

Observability — You Can't Fix What You Can't See

How to build distributed systems that tell you what's wrong, where it is, and why it happened — before your customers notice.

What's in this chapter

This chapter is about one central idea: a system you cannot observe is a system you cannot trust. You can write the cleanest code in the world, but if you can't tell what it's doing in production, you are flying blind.

We'll start by understanding the difference between monitoring and observability — they are not the same thing, and confusing them is expensive. Then we'll go deep on the three foundational signals: metrics, logs, and traces. We'll cover how each one works, what it's good for, where it falls short, and the common traps teams fall into. We'll add a fourth signal — continuous profiling — that most teams ignore until they wish they hadn't. And we'll close with how to think about observability not as something you add after the system is built, but as a first-class design concern.

Monitoring vs. Observability Metrics & Cardinality The RED Method Structured Logging Distributed Tracing Sampling Strategies Continuous Profiling Alerting That Works Observability-Driven Development

Key Learnings

Read this first. Come back to the full chapter for the why.

The Problem With "It Was Fine Yesterday"

Here is a scenario that happens at almost every company with a distributed system. It's 2am. Your on-call phone rings. The alert says "Error rate elevated." You log into your dashboards. The error rate is indeed elevated. Now what?

You look at the error rate chart. It's up. You look at latency. Also up. You check the recent deployment. Nothing in the last 6 hours. You check CPU and memory on your services. All normal. You grep through logs, but you're looking at three different services, producing thousands of lines per second, and you're searching for something you can't fully describe yet. Thirty minutes in, you find a suspicious stack trace. Is this the cause or a symptom? You're not sure.

This is what happens when a system is monitored but not observable.

Definition — Monitoring vs. Observability

Monitoring is the practice of collecting predefined signals and alerting when those signals cross predefined thresholds. It answers questions you already know to ask: "Is error rate above 1%?" "Is CPU above 80%?"

Observability is the property of a system that lets you understand its internal state by examining its external outputs. It answers questions you didn't know you'd need to ask: "Why did this specific customer's request fail at 14:32:07 on Tuesday?"

Monitoring is about known unknowns. Observability is about unknown unknowns. You need both. But they require very different designs.

A monitored system tells you that something broke. An observable system tells you where, why, and for whom. The difference between these two is often the difference between a 5-minute incident and a 3-hour one.

The Four Signals

Observability in distributed systems is built on four types of data. They are complementary — each answers a different question, and each has failure modes the others compensate for.

┌─────────────────────────────────────────────────────┐ │ Your Running System │ └──────┬──────────────┬─────────────┬────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ METRICS LOGS TRACES PROFILES "something "here is "here is "here is is wrong" what what where (numbers) happened" the path CPU went" (events) was" (flamegraphs)

In practice, an incident investigation usually follows this path: metrics alert you that something is wrong → traces narrow it to a specific service and request pattern → logs give you the exact error details → profiles (if needed) tell you why the code is behaving that way.

Signal 1 — Metrics

A metric is a number that changes over time. Request count, error count, latency, memory used, queue depth — these are all metrics. They are the cheapest signal to collect and store, and the fastest to query. For this reason, metrics are almost always the first signal you check during an incident.

What a Metric Actually Is

Every metric has three parts: a name, a value, and a set of labels (also called tags or dimensions). Labels are key-value pairs that let you slice the metric. For example:

# name: http_requests_total
# value: 1427
# labels: method=POST, endpoint=/checkout, status=200, region=us-east-1

http_requests_total{
  method="POST",
  endpoint="/checkout",
  status="200",
  region="us-east-1"
} 1427

Labels are what make metrics useful. Instead of one number that says "there were 1427 requests," you can ask "how many POST requests to /checkout succeeded in us-east-1 in the last 5 minutes?" Labels give you that slicing power.

The Cardinality Problem

Here is the most important constraint in metric systems, and the one that bites teams hardest: every unique combination of label values creates a separate time series.

If you have 3 methods × 50 endpoints × 10 status codes × 5 regions, that's 7,500 time series for one metric. That's fine. Your metrics backend can handle it.

Now imagine a new engineer on your team adds a label for user_id — seems reasonable, right? You have 2 million users. Suddenly that same metric has 2,000,000 × 3 × 50 × 10 × 5 = 15 billion time series. Your Prometheus instance falls over. Your on-call pager goes off for a completely different reason.

High Cardinality Labels — Never Do This

Never use these as metric labels: user IDs, request IDs, session IDs, order IDs, trace IDs, IP addresses, email addresses, or any field with unbounded unique values.

High-cardinality data belongs in logs and traces, not metrics. Metrics are for aggregated, bounded dimensions only.

The Four Metric Types

Every metric system distinguishes between a small set of fundamental types. Understanding them prevents the most common mistake: treating a counter like a gauge.

Type What it represents Example Common mistake
Counter A value that only goes up. Resets to zero on process restart. Total requests served, total errors Graphing the raw counter instead of the rate. Always use rate() or increase().
Gauge A value that goes up and down freely. Current memory used, queue depth, active connections Using a gauge for things that can never go down, where a counter with rate makes more sense.
Histogram Distribution of values across pre-defined buckets. Lets you calculate percentiles. Request latency, response size Using too few buckets (can't tell apart p90 and p99) or wrong bucket boundaries.
Summary Pre-calculated percentiles on the client side. Request latency (p50, p90, p99) Cannot aggregate across multiple instances — use histograms instead when you have multiple replicas.

The RED Method — What to Measure for Every Service

There are many frameworks for deciding what to instrument. The most practical one for most services is the RED method, introduced by Tom Wilkie at Weaveworks. Three metrics for every service, every endpoint:

The RED Method

R — Rate: How many requests per second is this service handling?

E — Errors: What fraction of those requests are failing?

D — Duration: How long are requests taking? (as a distribution, not an average)

If all three of these look healthy for a service, the service is probably fine from the user's perspective. If any one of them degrades, something is wrong.

Google calls a similar set the Four Golden Signals: Latency, Traffic, Errors, and Saturation. Saturation is the one RED omits — it measures how "full" a service is (CPU, queue depth, connection pool usage). Both frameworks are good. Pick one and apply it consistently.

The Averages Trap — The Most Common Metrics Mistake

Average latency is almost useless. Consider a service where 99% of requests complete in 5ms and 1% take 5,000ms. The average is about 55ms. That looks acceptable. Your alert doesn't fire. But 1% of your users — potentially millions of requests per day — are waiting 5 seconds.

Always track and alert on percentiles: p50 (median), p95, and p99 at minimum. p99.9 for systems where tail latency is a customer-facing concern (payments, search).

Push vs. Pull

There are two fundamental models for how metrics get from your service to your metrics backend.

In the pull model (used by Prometheus), the metrics backend periodically scrapes an HTTP endpoint that your service exposes. Your service doesn't need to know where the metrics system lives. The scraper discovers services through service discovery (e.g., querying Kubernetes for all pods with a specific label). This model is simple, observable itself (you can curl the endpoint), and works well when the metrics backend controls the collection schedule.

In the push model (used by StatsD, InfluxDB, Datadog), your service sends metrics to a central collector as they happen. This works well for short-lived processes like batch jobs that might finish before a pull-based scraper could reach them. It also works better in environments where services can't expose an HTTP listener (certain function-as-a-service setups, for example).

Most large-scale systems use some combination. The mental model: pull for always-on services, push for ephemeral workloads.

Signal 2 — Logs

A log is a record of something that happened. Every system produces logs. Most of those logs are useless during an incident because they were designed to be read by a human sitting in a terminal, not queried by a machine at 3am.

The single most impactful change you can make to your logging strategy is moving from unstructured to structured logs.

Unstructured vs. Structured

An unstructured log line looks like this:

2024-11-14 14:32:07 ERROR Failed to process payment for order 8841923: timeout after 3002ms (retry 2/3)

A human can read this. But a machine can't reliably parse it. If you want to find all orders that timed out on retry 2, you're writing a regex. If the message format changed three months ago, your regex breaks silently. If you want to graph timeout frequency over time, you're writing a fragile log parser that will break the moment someone changes the log message.

A structured log line looks like this:

{
  "timestamp":  "2024-11-14T14:32:07.341Z",
  "level":      "error",
  "event":      "payment_processing_failed",
  "order_id":   "8841923",
  "reason":     "timeout",
  "duration_ms": 3002,
  "retry":      2,
  "max_retries": 3,
  "service":    "payment-service",
  "trace_id":   "4bf92f3577b34da6"
}

Now you can query: event="payment_processing_failed" AND reason="timeout" AND retry=2. You can count events by reason, graph retry rates over time, and — critically — join this log entry to a trace using trace_id to see the full picture of what the request was doing.

The trace_id field is the connective tissue of observability

The single most valuable thing you can do to connect your signals together is to propagate a trace_id (or request_id, correlation_id) through every log line, every metric label (with care for cardinality), and every trace span. When an incident starts with a metric spike, you can jump to a sample trace ID, and from there jump directly to every log line that was part of that request — across all services.

Log Levels Are a Contract

Log levels are not just decoration. They are a promise to the person who will be woken up at 2am. Think of them this way:

Level What it means Who reads it
ERROR Something failed that should not fail. A human should look at this. On-call engineer, alerting system
WARN Something unexpected happened but the system recovered or degraded gracefully. Worth knowing, not urgent. On-call engineer (next morning)
INFO Normal operations worth recording — a user logged in, a job started, a payment processed. Engineers investigating behavior
DEBUG Fine-grained detail useful during development. Should never run in production at scale. Developers, temporary incident investigation

The most common mistake: using ERROR for anything the developer finds surprising, including expected errors like a user entering a wrong password. If you log every 401 Unauthorized as an ERROR, your error logs become noise. The on-call engineer learns to ignore the ERROR level. Then a real error happens and they don't notice.

A useful rule: if a human shouldn't be woken up for it, it should not be at the ERROR level.

Sampling Logs at High Volume

At high traffic, logging every event at INFO level is expensive. A service handling 100,000 requests per second that logs one line per request is producing 100,000 log lines per second. That's a lot to store, a lot to ship, and a lot to query.

A few techniques help here. First, log sampling: for high-volume, low-value events (like a health check endpoint being polled every 10 seconds), log 1 in 100 or 1 in 1000. Track the sample rate in the log line so you can extrapolate actual counts.

Second, dynamic log level control: keep your service at INFO level by default, but allow operators to temporarily elevate a specific service to DEBUG during an incident without a code deploy. This is enormously valuable and costs almost nothing to build.

Third, always log ERROR without sampling. The cost of missing an error log is higher than the cost of storing it.

Signal 3 — Distributed Traces

In a monolith, a request lives inside one process. When something goes wrong, you look at that process's logs. In a distributed system, a single user request might touch 10 or 15 services before returning a response. When it fails, which service was responsible? When it's slow, where did the time go?

Metrics can tell you that Service A's latency went up. Logs can tell you what Service A saw. Neither can tell you the full story of the request's journey across the system. Distributed tracing can.

How Tracing Works

When a request enters your system, you generate a unique trace ID — a random identifier for this entire request. Every service that touches the request gets this trace ID (propagated in HTTP headers, gRPC metadata, or message queue headers).

Each service records its own work as one or more spans. A span represents a unit of work: "Service B processed this request from time T1 to time T2." Spans know their parent span (who called them), so the whole thing forms a tree — the trace tree.

Trace ID: 4bf92f3577b34da6 │ ├─ Span: api-gateway [0ms ──────────────────── 240ms] │ │ │ ├─ Span: auth-service [2ms ──── 18ms] │ │ │ ├─ Span: product-service [20ms ────────────── 180ms] ← slow! │ │ │ │ │ ├─ Span: db-query [22ms ─── 40ms] │ │ ├─ Span: db-query [42ms ─── 60ms] │ │ ├─ Span: db-query [62ms ─── 80ms] ← N+1 query! │ │ ├─ Span: db-query [82ms ── 100ms] │ │ └─ Span: cache-miss [101ms ──────────── 178ms] │ │ │ └─ Span: cart-service [182ms ── 198ms] │ └─ Total: 240ms (SLA is 200ms) ← breached

This view — called a Gantt chart or waterfall view — immediately shows you several things that no metric or log would easily reveal: the product-service is the bottleneck, it's making 4 sequential database queries (an N+1 pattern), and there's an expensive cache miss. You know exactly where to look.

What to Put in a Span

A span captures more than just start time and end time. Good spans include:

Context Propagation — The Hard Part

The technical challenge in distributed tracing is not collecting spans — any library can do that. The hard part is context propagation: making sure the trace ID travels with the request through every hop, including ones you don't control.

HTTP calls are easy — there are standard headers (traceparent from the W3C Trace Context standard, X-B3-TraceId from the older Zipkin standard). Most tracing libraries handle this automatically.

The holes in your trace graph usually come from: message queues (did you put the trace ID in the message metadata?), batch jobs (a nightly job triggered by a previous request — does it carry the original trace ID?), third-party services (they won't propagate your headers), and async callbacks (the trace ID must be stored alongside the callback registration, not just in memory at call time).

OpenTelemetry — The Standard Worth Adopting

OpenTelemetry (OTel) is an open standard and SDK for instrumentation that covers metrics, logs, and traces. It has good language support (Go, Java, Python, Node.js, Ruby, .NET), and it's vendor-neutral — you instrument once and can send to Jaeger, Zipkin, Honeycomb, Datadog, or any OTel-compatible backend.

The main reason to use it: you do not want to re-instrument your entire codebase when you switch observability vendors. OTel gives you that portability.

Sampling — The Decision That Shapes What You See

A large service might handle hundreds of thousands of requests per second. Recording a trace for every request would cost a fortune in storage and processing, and most of those traces would show exactly the same healthy behavior. You need to sample — but how you sample determines which problems you can find.

Head-Based Sampling

In head-based sampling, the decision to trace a request is made at the very start — when the first service receives it. If you sample 1% of requests, you roll a die when the request arrives, and that decision is propagated to all downstream services: they either all trace or all skip.

This is simple and cheap. But there is a fundamental problem: you make the decision before you know anything about how the request will behave. A slow request, an errored request, a request that hits a rare code path — these have the same 1% chance of being captured as a completely normal request. The bugs that are hardest to find are often in the rarest requests, and they are systematically undersampled.

Tail-Based Sampling

In tail-based sampling, you buffer the spans from all services as the request flows through the system. After the request is complete, you look at the full trace and decide whether to keep it: Was it slow? Did it error? Did it take an unusual path? If yes, keep it. If it was completely normal, drop it.

This is how you find the 1% of requests that are failing, the 0.1% that are anomalously slow, and the edge cases you didn't anticipate. The downside is cost and complexity: you need a component (a tail-sampling proxy or collector) that buffers spans in memory, waits for the request to complete, evaluates the sampling rules, and then flushes or drops.

For most teams, a pragmatic middle ground works well: always sample errors and high-latency requests at 100%, and sample normal successful requests at a low rate (0.1%–1%). This gives you full coverage of failures and a statistically valid view of normal behavior.

# Tail-sampling rule example (OpenTelemetry Collector config)
tail_sampling:
  decision_wait: 10s    # wait up to 10s for all spans to arrive
  policies:
    - name: always-sample-errors
      type: status_code
      status_code: { status_codes: [ERROR] }

    - name: always-sample-slow
      type: latency
      latency: { threshold_ms: 1000 }

    - name: sample-normal-at-1-percent
      type: probabilistic
      probabilistic: { sampling_percentage: 1 }

Signal 4 — Continuous Profiling

Metrics tell you that CPU is high. Traces tell you which service is slow. But neither tells you why the code is slow. That's what profiling is for.

Traditional profiling means attaching a profiler to a specific process at a specific moment, running a load test, and analyzing the output. It's a manual, offline activity. Continuous profiling is different: it runs in the background, all the time, in production, with low enough overhead that you don't need to turn it off.

How It Works

A continuous profiler interrupts your running process at a fixed frequency — say, 100 times per second — records the current call stack, and returns control. Over time, these stack samples accumulate into a statistical picture of where your CPU time is actually going. Functions that appear on more stack samples are using more CPU.

The output is a flame graph — a visualization where the x-axis is time (or sample count), the y-axis is the call stack depth, and the width of each block is proportional to how much time was spent in that function including its callees. The wide blocks at the top of the flame are the bottlenecks.

A Real Example of What Profiling Finds

Imagine a service where Prometheus shows a CPU spike every day between 2am and 3am. No obvious cause. No errors. Just high CPU. An on-call engineer is woken up. They check metrics, check logs — nothing explains it.

A flame graph from continuous profiling shows that during that hour, 40% of CPU is being spent inside JSON.parse in a function called deserialize_config. Further investigation: a background job is reading a configuration file, deserializing it into an object, using one field, and discarding it — once per request, 10,000 times a minute, instead of caching it once at startup.

The fix is 3 lines of code. Without the profile, the team would have been looking for the cause for hours.

Continuous profiling is the least adopted of the four signals, which is a shame because it's often the fastest path from "something is slow" to "here is exactly why." Tools worth knowing: Pyroscope (open source), Parca (open source), Datadog Continuous Profiler, Google Cloud Profiler.

Alerting That Doesn't Cry Wolf

Collecting signals is the first half of observability. The second half is routing the right signals to the right people at the right time. This is alerting, and most teams do it badly.

Alert Fatigue Is a System Property

Alert fatigue is what happens when engineers receive too many alerts that don't require action. They learn to dismiss alerts without reading them. The alert that actually matters gets dismissed too. You've seen this — an on-call rotation where everyone mutes their phone by week two.

Alert fatigue is not a people problem. It is a design problem. The alert system was designed to produce noise, and humans adapted to it the only way they could.

The Wrong Mental Model for Alerting

The instinct is: alert on potential problems early, so the on-call has time to react before things get worse. So you alert on CPU > 70%, disk > 80%, memory > 75%, queue depth > 1000.

The problem: most of these thresholds fire regularly during normal traffic patterns. CPU at 72% on a Tuesday morning might be completely fine. The on-call acknowledges it, sees nothing wrong, and learns to ignore CPU alerts. Now CPU at 98% — a real problem — also gets ignored.

Alert on Symptoms, Not Causes

The right mental model for alerting: alert when users are affected, not when resources are stressed.

A user is affected when:

Notice: none of these say "CPU" or "memory" or "disk." Those are causes, not symptoms. High CPU might cause high latency — but you should alert on the latency, not the CPU. If CPU is high but latency is fine, there is no user impact and no reason to wake anyone up.

Causes — CPU, disk, queue depth — are useful for dashboards and postmortem analysis. They are generally not useful as alert triggers.

Burn Rate Alerting — The Right Way to Alert on SLOs

If your SLO says "99.9% of requests complete successfully," you have an error budget of 0.1%. Over a 30-day window, that's about 43 minutes of allowed downtime or 0.1% of your requests.

The naive approach: alert when error rate exceeds 0.1%. The problem: if your error rate is 0.11% for an entire month, you'll burn through your budget slowly and your alert will fire constantly at a low, annoying rate.

A better approach: burn rate alerting. Alert when your error budget is being consumed faster than sustainable. A 1x burn rate means you'd use up the budget in exactly 30 days. A 14x burn rate means you'd use it up in 2 days.

# Alert when burning through error budget 14x faster than sustainable
# i.e., in 2 days you'd exhaust a 30-day budget

alert: HighErrorBudgetBurnRate
expr: (
    rate(http_requests_errors_total[1h])
    /
    rate(http_requests_total[1h])
  ) > (14 * 0.001)   # 14x burn rate × 0.1% target error rate
for: 5m
severity: critical   # page someone now

Google's SRE workbook describes a multi-window burn rate approach — combining a fast window (for immediate spikes) and a slow window (for gradual degradation) — that reduces both false positives and alert latency simultaneously. It's worth implementing for any service with a formal SLO.

Every Alert Should Have a Runbook

An alert without a runbook is an alarm clock without a snooze button. The on-call engineer wakes up, sees the alert, and has to figure out what to do from scratch — every time.

A runbook doesn't have to be long. It needs to answer three questions:

  1. What does this alert mean? What state is the system in when this fires?
  2. What is the impact? Which users are affected and how?
  3. What should I do? A numbered list of steps, starting with diagnosis and ending with mitigation or escalation.

Link the runbook directly in the alert. On-call at 3am, half asleep, does not want to open a wiki and search for the right page.

Connecting the Signals — Making Them Work Together

The real power of observability is not any single signal — it's the ability to move fluidly between them during an investigation. Here's what that looks like in practice.

1. METRIC ALERT FIRES "Error rate on /checkout is 3.2% (SLO: 0.5%)" │ ▼ 2. CHECK THE METRICS DASHBOARD Rate, Error, Duration — which service? → Product service error rate also elevated │ ▼ 3. JUMP TO A SAMPLE TRACE Find a failed trace for /checkout in the last 5 min → Trace shows product-service → DB call timing out → DB spans show wait time, not query time │ ▼ 4. QUERY LOGS FOR THAT TRACE ID trace_id="4bf92f3577b34da6" AND service="product-service" → "connection pool exhausted, waited 2900ms" │ ▼ 5. CHECK THE DB CONNECTION POOL METRIC product_service_db_pool_waiting_count → Spike started 22 minutes ago, correlates with a deploy │ ▼ 6. ROOT CAUSE: new feature deployed that holds DB connections open Fix: roll back deploy, or patch connection release bug Time to diagnosis: ~8 minutes

Each signal handoffs to the next. Metrics give you the "what" and the "when." Traces give you the "where" in the system. Logs give you the specific error detail. The trace_id is the key that connects them.

This investigation would have taken an hour or more with only logs, and might never have found the connection pool root cause with only metrics.

Observability-Driven Development

Most teams treat observability as something you add after the system is built. You write the service, you deploy it, and then you add some dashboards. This is backwards.

The problem with retrofitting observability is that the information you need is often not there. Log lines don't have the context you need. Spans don't have the right attributes. There's no trace ID in the right places. Fixing this requires touching the code everywhere, and the engineer who built it may not be around.

Observability-driven development means asking, before you ship any feature: "If this breaks in production, how will I know? How will I diagnose it?" If you can't answer both questions, you're not done.

A Practical Pre-Ship Checklist

Before any new service or major feature goes to production, verify:

OBSERVABILITY PRE-SHIP CHECKLIST

Metrics
  ✓  RED metrics instrumented (rate, errors, duration as histogram)
  ✓  No high-cardinality labels (no user IDs, request IDs)
  ✓  Business metrics tracked (orders created, payments processed)
  ✓  Metrics visible in a dashboard

Logs
  ✓  Structured JSON logging (not free-text)
  ✓  trace_id propagated and included in every log line
  ✓  No PII in logs (email, password, SSN, payment details)
  ✓  ERROR level reserved for actionable failures
  ✓  Log levels tunable at runtime without redeploy

Traces
  ✓  All inbound and outbound calls produce spans
  ✓  Spans have meaningful names and relevant attributes
  ✓  Context propagated through async calls and queues
  ✓  Sampling configured (errors and slow requests at 100%)

Alerts
  ✓  Alert defined for user-visible error rate
  ✓  Alert defined for p99 latency
  ✓  Every alert has a linked runbook
  ✓  Alerts tested (does the alert actually fire in staging?)

The "Canary Operator" Mindset

When you deploy a new version, you should be watching your observability signals before the deployment team declares it done. Not just checking that error rate is zero — actively watching latency percentiles (p50, p95, p99), comparing them against the previous version, watching business metrics (did conversion rate drop?), and being ready to roll back if anything drifts.

The time from "deploy initiated" to "deployment declared healthy" should have a human watching dashboards for at least 5–10 minutes for any change to a critical path. Automated canary analysis — where a tool watches the signals and makes the rollout/rollback decision — is even better once you have the data quality to trust it.

Common Mistakes and How to Avoid Them

Mistake 1 — Treating the Dashboard as the System's Health

A dashboard that looks green is not the same as a system that is healthy. Dashboards show the metrics you thought to instrument. The problem might be in a metric you didn't think to instrument. "The dashboard looks fine" is not a valid answer to "is the system healthy?" during an incident. Always verify with real user data or synthetic probes.

Mistake 2 — Logging Sensitive Data

Logs, traces, and metrics are often stored in systems with broader access than your production database. Engineers searching logs during an incident don't need to see full credit card numbers, passwords, SSNs, or health information. Beyond the security risk, in most jurisdictions, logging PII creates compliance obligations.

Build log scrubbing into your logging library at the framework level — not as something individual engineers have to remember per log line. Define a list of fields that are always redacted or hashed before they leave the process.

Mistake 3 — Too Many Dashboards, No Standard

Over time, teams accumulate dashboards the way code accumulates comments — promiscuously, without deletion. You end up with 200 dashboards and no one knows which one is authoritative. During an incident, people pull up different dashboards and see different pictures of the system.

Designate a small number of canonical dashboards: one overview per service (the RED metrics), one infrastructure dashboard, one business metrics dashboard. Everything else is exploratory and can be deleted when the person who built it leaves.

Mistake 4 — Observability as a Cost Center

Teams cut observability costs during budget pressure and then pay for it during the next major incident. The correlation is direct and well-documented: companies that invest in observability have shorter incident times, lower customer impact, and lower total cost of incidents.

A useful way to think about the budget conversation: how long does a Severity 1 incident cost your company per hour? Whatever that number is, good observability that cuts diagnosis time from 2 hours to 15 minutes pays for itself in the first major incident.


Chapter Summary

The Key Principle

Observability is not a feature you add after the system is built — it's a property you design in from the start. A system that can't explain its own behavior will eventually fail in a way you can't diagnose, at the worst possible time.

What We Covered

  • Monitoring vs. observability — a critical distinction
  • Metrics: types, cardinality trap, RED method
  • Why averages lie and percentiles tell the truth
  • Structured logging and why it matters
  • Distributed tracing: spans, context propagation
  • Head-based vs. tail-based sampling trade-offs
  • Continuous profiling and flame graphs
  • Alerting on symptoms, burn-rate SLO alerting
  • Pre-ship observability checklist

The Most Common Mistakes

  • Using average latency instead of percentiles
  • Adding high-cardinality labels to metrics
  • Unstructured logs that can't be queried
  • Head-based sampling that hides rare failures
  • Alerting on CPU/memory instead of user impact
  • Alerts without runbooks
  • Skipping observability until after the launch
  • Logging PII in plain text

Three Questions for Your Next Design Review

  1. If this service starts returning errors for 1% of users starting right now, how long before your on-call engineer knows about it, and what will they see?
  2. Can you answer "which downstream call is responsible for this request being slow" without reading source code or deploying new instrumentation?
  3. When this alert fires at 3am, can the on-call engineer — who has never seen this service before — diagnose and mitigate the problem using only the runbook?