Chapter 35  ·  Part IX — Security as a System Property

Trust in Distributed Systems

How services prove who they are, verify who's calling them, and decide what that caller is allowed to do — at scale, without humans in the loop.

What's in this chapter

Key Learnings — Read This First

01 — Zero Trust

Network location proves nothing. A service inside your VPC should be treated with the same suspicion as one on the internet. Every request must carry proof of identity.

02 — mTLS

Regular TLS proves the server's identity to the client. mTLS also proves the client's identity to the server. This is the foundation of service-to-service trust.

03 — SPIFFE

SPIFFE gives every workload a URI-based identity (e.g. spiffe://prod/payments/api) backed by a short-lived X.509 certificate. No humans manage these — the platform rotates them automatically.

04 — Short-lived Certs

Certificate revocation (CRL, OCSP) doesn't work reliably at scale. The practical solution: issue certificates that expire in hours, not years, so they're stale before they can be abused.

05 — AuthN ≠ AuthZ

Authentication answers "who are you?" Authorization answers "what are you allowed to do?" They're separate systems. Mixing them produces code that's hard to audit and easy to get wrong.

06 — Blast Radius

The goal of all of this is containment. If Service A is compromised, it should not automatically give the attacker access to Service B's data. Identity + fine-grained authorization limits the damage.

The Problem: Who Do You Trust?

Imagine you run an online store. Your order service calls your payment service. When the payment service receives a request, how does it know the request actually came from the order service — and not from a script running on a compromised server inside your own data center?

For a long time, the answer was: it trusts anything on the internal network. If a request came from an IP address inside the corporate VPN or private cloud, it was assumed to be legitimate. This is called the castle-and-moat model. Build high walls around your network, and trust everything inside.

This model has one catastrophic flaw. Once an attacker gets inside — through a phishing attack, a compromised dependency, a misconfigured cloud bucket — they can move freely. They don't need to break in again. They're already trusted.

The Target breach in 2013 started through an HVAC contractor's network credentials. The attackers moved laterally across internal systems for weeks before anyone noticed, precisely because once inside the perimeter, everything was trusted.

The answer to this problem is zero trust networking: never trust based on network location alone. Every request must carry proof of identity. Every service must verify that proof. And every verified identity must be checked against what it's actually allowed to do.

Zero Trust: Never Trust, Always Verify

Zero trust is not a product you buy. It's a design philosophy. The core idea is simple:

Google's internal implementation of this, called BeyondCorp, is described in a 2014 paper that became influential across the industry. Google moved all their internal tools off the VPN model and onto certificate-based identity. An employee's network location — whether in the office, at home, or in a coffee shop — stopped mattering. What mattered was whether they had a valid credential.

The same principle applies to services. Your payment service should not trust the order service because they share a private network. It should trust the order service because the order service presents a valid certificate proving its identity.

Practical Framing

Think of zero trust like a building with individual locked rooms instead of just a locked front door. Getting past the front door gets you into the lobby. To get into the server room, you need a separate key. To get into the database room, another key still. A zero trust system is built the same way — each boundary requires fresh proof of identity and permission.

Mutual TLS: The Mechanism Behind Service Identity

You're probably familiar with HTTPS. When your browser connects to https://example.com, the server presents a certificate. Your browser verifies it was signed by a trusted authority (a CA — Certificate Authority). This proves to you that you're talking to the real example.com, not an impostor. This is called one-way TLS or simply TLS.

But this only proves the server's identity to the client. The server has no idea who the client is.

Mutual TLS (mTLS) extends this. The client also presents a certificate. The server verifies it. Now both sides have proven their identity to each other. This is what we use for service-to-service calls in a zero trust system.

TLS vs mTLS — who proves what
One-way TLS (regular HTTPS): Client ────────────── "I want to connect" ──────────────→ Server Client ←───────── "Here is my certificate" ──────────── Server Client verifies server cert against trusted CA Server identity: proven ✓ Client identity: unknown ✗ Mutual TLS (mTLS): Client ────────────── "I want to connect" ──────────────→ Server Client ←───────── "Here is my certificate" ──────────── Server Client verifies server cert against trusted CA Client ────────── "Here is MY certificate" ──────────→ Server Server verifies client cert against trusted CA Server identity: proven ✓ Client identity: proven ✓

The certificates themselves are just files containing a public key and some metadata (who issued it, when it expires, who it belongs to). The private key never leaves the service. When a service presents its certificate, it also performs a cryptographic proof that it holds the corresponding private key. An attacker who only steals the certificate file — not the private key — cannot impersonate the service.

What mTLS actually proves

It's worth being precise here, because this trips people up.

mTLS proves that the caller holds a specific private key, and that key corresponds to a certificate signed by a CA you trust. It does not prove that the caller is behaving correctly, that its code hasn't been tampered with, or that it's authorized to perform the specific action it's requesting.

Think of it like a government-issued ID. When someone shows you their passport, you know they're who the passport says they are (assuming the passport is real and unforgeable). But their identity doesn't tell you whether they're allowed to enter a specific room in your building. That's a separate decision.

Authentication (mTLS proving identity) and authorization (deciding what that identity can do) are separate concerns. We'll come back to this distinction later in the chapter.

The problem with managing certificates by hand

If you have 5 services, you could manually generate certificates for each one, distribute them, and manage their rotation. Painful, but possible.

If you have 500 services running thousands of instances across multiple data centers, manual certificate management becomes a full-time job — and a source of outages. A certificate expires over a holiday weekend. Someone rotates the wrong cert. A new service gets deployed with a stale certificate. These are real, common failure modes.

This is the problem that SPIFFE and SPIRE were designed to solve.

SPIFFE and SPIRE: Automated Identity for Workloads

SPIFFE stands for Secure Production Identity Framework For Everyone. It's a set of open standards for how services identify themselves. SPIRE is the reference implementation — the software you actually run.

The core idea is: every workload (a service, a job, a container, a function) gets a cryptographic identity automatically when it starts. That identity is:

No human generates these certificates. No operator distributes them. The platform handles the whole lifecycle.

How SPIRE works under the hood

SPIRE has two main components: the SPIRE Server (the central authority) and SPIRE Agents (running on every node/host).

SPIRE architecture — how a workload gets its identity
At startup: Workload (e.g. payments service) starts on Node A │ ▼ SPIRE Agent on Node A │ "This workload has these properties: │ - namespace: payments │ - k8s service account: api │ - node attestation: verified" │ ▼ SPIRE Server │ Checks: does this match a registered entry? │ Entry: spiffe://prod/payments/api │ for pods with SA=api in NS=payments │ ▼ YES — issue certificate Agent receives signed X.509 SVID │ ▼ Workload gets SVID via Workload API Certificate valid for: 1 hour Agent auto-renews at 30 min remaining

The key concept here is workload attestation. The SPIRE Agent doesn't just hand out certificates to anyone who asks. It looks at the requesting process — its Kubernetes service account, its namespace, its binary path, its parent process — and checks whether those properties match a registered entry. Only if they match does it issue the SVID.

This means an attacker who deploys a rogue container cannot simply claim the identity of your payments service unless they can also replicate the exact Kubernetes metadata that the payments service runs with.

Important Nuance

SPIRE's security ultimately depends on the security of your node attestation. If an attacker has root access to a node, they may be able to impersonate any workload running on that node. This is why defense in depth matters — SPIFFE/SPIRE raises the bar significantly, but it doesn't replace OS-level security and container isolation.

JWT-SVIDs vs X.509-SVIDs

SPIFFE supports two formats for identity documents.

X.509-SVIDs are standard X.509 certificates. They're used directly in mTLS connections. The TLS handshake presents the certificate, and the other side verifies it. No extra code needed in your application — the TLS library handles everything.

JWT-SVIDs are JSON Web Tokens signed by the SPIRE Server. They're useful when you need to pass identity through an HTTP header — for example, when calling an API gateway that terminates TLS and needs to forward identity information to the backend. The token carries the SPIFFE ID and an expiry time, and any service can verify it by checking the signature against the SPIRE Server's public key.

Property X.509-SVID JWT-SVID
Primary use mTLS connections HTTP header, REST APIs
Verification mechanism TLS handshake JWT signature check
App code required Minimal — TLS library handles it Yes — must extract and verify header
Works through proxies TLS must be terminated Yes — header passes through
Replay risk Low — tied to connection Higher — token can be copied

In practice, most internal service-to-service traffic uses X.509-SVIDs with mTLS. JWT-SVIDs are used at boundaries where TLS is being terminated — like at an API gateway or a load balancer.

The Certificate Lifecycle Problem

Every certificate has an expiry date. When it expires, the connection fails. This seems straightforward — just renew before expiry. But at scale, this becomes a genuine operational problem.

Revocation: the harder problem

Expiry handles the case where a certificate naturally ages out. But what about a certificate that is compromised before it expires? Maybe the private key leaked, or the service it belongs to was decommissioned. You need a way to say "this certificate, which hasn't expired yet, should no longer be trusted."

There are two standard mechanisms for this.

CRL (Certificate Revocation List): The CA maintains a list of revoked certificate serial numbers. Services periodically download this list and check incoming certificates against it. The problem is that lists can be large, they're downloaded on a schedule (not real-time), and services need to decide what to do when they can't fetch an updated list.

OCSP (Online Certificate Status Protocol): Instead of a list, each certificate contains a URL. When you see a certificate, you make an HTTP request to that URL asking "is this cert still valid?" The CA responds in real time. The problem: this adds a network round-trip to every TLS handshake. It also creates availability coupling — if the OCSP server is slow or down, your TLS handshakes get slow or fail.

Why Both Approaches Break at Scale

At high traffic volumes, OCSP creates enormous load on the CA's servers and adds latency to every connection. CRL becomes unwieldy as the list grows. Neither mechanism provides instant revocation — there's always a window (minutes to hours) where a revoked cert might still be accepted. This is why the industry has largely moved toward a different approach.

The practical solution: short-lived certificates

The approach that actually works at scale is elegant in its simplicity: issue certificates that expire quickly. If a certificate is valid for one hour, then even if it's compromised, it's useless within an hour. You don't need revocation if the certificate expires before it can be abused.

This is exactly what SPIRE does by default. It issues SVIDs with a one-hour TTL and starts renewing them when 30 minutes remain. The workload always has a valid, fresh certificate without any human involvement.

Short-lived cert rotation lifecycle
T=0:00 SPIRE issues cert (expires in 60 min) Workload uses cert for all mTLS connections T=0:30 30 min remaining — SPIRE Agent begins renewal Requests new cert from SPIRE Server Server issues new cert (expires T=1:30) T=0:31 Workload seamlessly switches to new cert Old cert still valid for 29 more minutes In-flight connections are not disrupted T=1:00 Old cert expires — no longer accepted If cert was compromised at T=0:15, it was usable for at most 45 minutes No revocation infrastructure needed

The trade-off is that your certificate infrastructure needs to be highly available. If the SPIRE Server is down when a workload tries to renew, the workload will eventually have an expired certificate and start failing connections. This is why SPIRE Servers are deployed in high-availability clusters with replicated state.

The rule of thumb most teams use: the shorter the TTL, the less you need revocation infrastructure, but the higher the availability requirements on your CA. A one-hour TTL is a reasonable balance for most production systems. Some high-security environments use 15-minute TTLs. Certificates valid for years are only appropriate for situations with strong revocation infrastructure and low compromise risk.

Authentication vs. Authorization: The Crucial Distinction

These two words are often used interchangeably. They describe completely different things. Getting the distinction wrong leads to security systems that are either too permissive (anything authenticated can do anything) or too brittle (authorization logic spread across every service, impossible to audit).

Authentication (often shortened to AuthN) answers the question: who are you? It's about establishing identity. mTLS is an authentication mechanism. A JWT is an authentication token. The output of authentication is a verified identity — for example, spiffe://prod/payments/api.

Authorization (AuthZ) answers the question: what are you allowed to do? It takes a verified identity and a requested action and returns allow or deny. The input to authorization is the identity (from authentication) plus the action being attempted. The output is a decision.

The two-step process for every request
Incoming request from Service A to Service B Step 1 — Authentication "Does this request carry a valid credential?" Service B checks the mTLS certificate. Certificate is signed by trusted CA. Certificate contains SPIFFE ID: spiffe://prod/order/api Result: caller is "order service" ✓ Step 2 — Authorization "Is the order service allowed to call this endpoint?" Check policy: can spiffe://prod/order/api call payments.CreateCharge? Result: YES — allow ✓ OR Check policy: can spiffe://prod/order/api call payments.RefundAll? Result: NO — deny ✗

The critical point is that a successfully authenticated identity does not automatically mean the caller can do anything. The order service might be legitimately authenticated and still not have permission to trigger a full refund. These are independent checks.

Where to enforce authorization

There are three places authorization can live, each with different trade-offs.

In application code — each service implements its own authorization checks. This is the most common approach and the most fragile. The logic is scattered across dozens of services, written inconsistently, and nearly impossible to audit. When a security team asks "which services can call the payments refund endpoint?", the answer requires reading the code of every service in the system.

In a shared library — a common authorization library is distributed across services. Better consistency, but you've now coupled every service to the library's release cycle. And the logic is still in application code — it's just shared application code.

In the infrastructure layer — the service mesh or sidecar proxy enforces authorization before the request even reaches the application. This is the most powerful approach. The application doesn't need to know anything about authorization policies — the infrastructure enforces them. A security team can audit, update, and audit again from a central policy store without touching application code.

Service mesh authorization with Istio

A service mesh like Istio deploys a small proxy (called Envoy) alongside every service. All inbound and outbound traffic passes through this proxy. The proxy terminates mTLS, verifies the peer's SPIFFE identity, and checks the request against authorization policies before forwarding it to the application.

Istio AuthorizationPolicy — declarative, infrastructure-level
# Allow the order service to call the payments service,
# but only on specific methods. Deny everything else.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payments-authz
  namespace: payments
spec:
  # Apply this policy to the payments service
  selector:
    matchLabels:
      app: payments-api

  rules:
  - from:
    - source:
        # Only requests from the order service
        principals:
        - "cluster.local/ns/orders/sa/order-api"
    to:
    - operation:
        # Only these specific methods
        methods: ["POST"]
        paths:
        - "/v1/charges"
        - "/v1/charges/*/capture"

# Any other caller, or any other path, is denied by default.
# No code changes required in the payments service.

The payments service itself does not check who's calling it. The proxy handles that. The payments engineers can focus on payment logic. The security team owns the authorization policy. These concerns are properly separated.

The Audit Advantage

When all authorization policies are declared as infrastructure resources (like the YAML above), a security audit becomes straightforward. You can answer "which services can call payments.CreateCharge?" by querying your policy store, not by grepping application code across 50 repositories. This alone is a significant operational improvement.

RBAC vs. ABAC: choosing the right model

Role-Based Access Control (RBAC) assigns permissions to roles, then assigns roles to identities. Your order service has the role payments-caller. That role can call specific payment endpoints. It's simple, easy to understand, and works well when identities fall into a manageable number of roles.

Attribute-Based Access Control (ABAC) makes decisions based on arbitrary attributes of the request — the caller's identity, the resource being accessed, the time of day, the requested action, even values in the request payload. It's more flexible but more complex. A policy might say "allow if caller is the order service AND the requested charge amount is under $10,000 AND the request comes during business hours."

For service-to-service authorization, RBAC is usually sufficient. The identities are well-defined (SPIFFE IDs), the actions are well-defined (gRPC methods, HTTP paths), and the rules are relatively stable. ABAC becomes valuable when you need to make decisions that depend on data — for example, "the user can only modify their own records, not someone else's."

A common pattern is to use RBAC at the infrastructure layer (service mesh enforces coarse-grained access: order service can call payments) and ABAC in application code for fine-grained data-level decisions (this user can only view their own invoices).

Putting It All Together: Trust End-to-End

Let's walk through a complete example: a user clicks "buy" in your mobile app, which triggers a chain of service calls.

End-to-end trust chain for a purchase request
User → API Gateway User sends JWT (from login) in Authorization header Gateway validates JWT signature against auth service public key AuthN: user is user_id=12345 ✓ Gateway checks: is user allowed to make purchases? ✓ AuthZ: user has "buyer" role ✓ API Gateway → Order Service mTLS: Gateway presents spiffe://prod/gateway/api mTLS: Order svc presents spiffe://prod/orders/api Both sides verify each other's certs Mutual AuthN ✓ Istio policy: gateway is allowed to call orders ✓ AuthZ ✓ Gateway forwards user context in request header: x-user-id: 12345 Order Service → Payment Service mTLS: Order svc presents spiffe://prod/orders/api mTLS: Payment svc presents spiffe://prod/payments/api Mutual AuthN ✓ Istio policy: order svc is allowed to call payments.CreateCharge ✓ AuthZ ✓ Payment svc records: charge created by order svc on behalf of user 12345 If payments tried to call an admin endpoint not in its policy: Istio proxy: DENY — no policy allows this Request blocked before reaching the application ✗

Notice a few things in this picture. Each hop has its own authentication and authorization. The blast radius of a compromise is limited — a compromised order service can call the payment endpoints it's authorized for, but not admin endpoints, not other services it's not supposed to reach, not sensitive data stores it has no policy for.

Also notice that user identity (user_id: 12345) is propagated as data in the request, separate from service identity (the mTLS certificate). Service identity proves which service is calling. User context tells the service which end-user originated the action. Both matter, for different reasons.

Common Mistakes and How to Avoid Them

Mistake 1: Using network segmentation as a substitute for service identity

Many teams think "we have the payment service in its own subnet with tight firewall rules, so we don't need mTLS." Firewalls reduce the attack surface but don't eliminate the problem. A compromised service inside that subnet can still make calls that look legitimate to a firewall. mTLS ensures that the caller proves their identity regardless of where the call originates.

Mistake 2: Long-lived service account credentials

Static API keys or long-lived tokens handed to services as environment variables are a major vulnerability. They get checked into git repositories. They get leaked in logs. They never expire. Replace them with short-lived credentials issued by SPIRE, or at minimum with a secrets manager that rotates credentials automatically.

Mistake 3: Treating authentication as sufficient

A common architecture has every service accept any authenticated caller and then implement authorization in application code inconsistently. Service A might check permissions carefully. Service B might skip the check for a path added in a hurry during an incident. The infrastructure authorization approach prevents this — the policy is enforced before the request reaches the application, regardless of how carefully each service's application code was written.

Mistake 4: Wide service-to-service permissions

It's tempting to define one broad policy: "backend services can call any other backend service." This feels simple, but it means a compromised backend service has access to your entire internal API surface. Apply the principle of least privilege — each service should only be allowed to call the specific endpoints it actually needs.

The Permission Creep Problem

Service-to-service permissions tend to grow over time and never shrink. A service gets access to an endpoint for a one-off task, the task is done, but the permission remains. Over time, every service accumulates permissions it doesn't need. Schedule a quarterly review of your authorization policies. Anything unused for 90 days is a candidate for removal.

Mistake 5: No audit log of authentication decisions

When a security incident happens, you need to answer: which services were talking to each other, when, and with what identity? If your service mesh or identity system doesn't produce structured access logs, you're flying blind during an investigation. Ensure every mTLS connection and every authorization decision is logged with the caller's SPIFFE ID, the requested action, and the policy decision.

How to Adopt This Incrementally

You don't need to implement all of this at once. Here's a reasonable progression for a team moving from "trust everything on the network" to a mature zero trust posture.

Stage 1 — Visibility. Before enforcing anything, understand your current call graph. A service mesh in "permissive mode" (allow all, log everything) will show you which services call which other services without breaking anything. This gives you the data you need to write accurate policies later.

Stage 2 — Identity. Deploy SPIRE and assign SPIFFE identities to all workloads. Enable mTLS in the mesh, but still in permissive mode. Services now identify themselves even though nothing is enforced yet. Verify that identities are being issued correctly and that certificates are rotating as expected.

Stage 3 — Coarse enforcement. Switch the mesh to deny-by-default for a subset of sensitive services — your payment service, your user database proxy, your secrets manager. Write policies that match your observed call graph from Stage 1. Monitor for denials and fix legitimate calls that get blocked.

Stage 4 — Broad enforcement. Roll out deny-by-default across all services. This is the disruptive stage — previously unknown or undocumented call paths will break. Have a clear process for teams to get new call paths approved and added to policy.

Stage 5 — Tighten. Gradually narrow permissions. Instead of "order service can call any POST endpoint in payments", specify the exact paths. Remove permissions that haven't been used in 90 days. This is ongoing work, not a one-time task.

Chapter 35 — End Summary

The Key Principle

Network location proves nothing. Every service call must carry cryptographic proof of identity, and every verified identity must be checked against what it's allowed to do — these are two separate systems that must both be in place.

The Most Common Mistake

Treating a successfully authenticated caller as automatically authorized to do anything. mTLS proves identity. It says nothing about permissions. Authentication and authorization must be enforced independently.

Three Questions for Your Next Design Review

  • If Service A is compromised, which other services can it call and what data can it access?
  • Are your service-to-service credentials short-lived and automatically rotated, or static secrets that could be leaked?
  • Can you answer "which services are authorized to call this endpoint?" from a central policy store, or only by reading application code?
← Previous Ch 34: Security Is Not a Feature Table of Contents All Chapters →