Part IX  ·  Chapter 34  ·  Security as a System Property

Security Is Not a Feature,
It's a Constraint

Why bolt-on security always fails, and how to design systems where security is load-bearing from the start

Part IX of X
Chapter 34 of 38
Reading time ~35 min
Difficulty Intermediate

What's in this chapter

Key Learnings

01

Security is a constraint, not a feature. A feature is optional and addable later. A constraint shapes every decision from the start. Systems designed without security as a constraint are not "insecure" — they are fundamentally different systems that happen to look similar.

02

Threat modeling is just asking four questions: What are we building? What can go wrong? What are we going to do about it? Did we do a good job? You don't need a security team to do this — you need 90 minutes and the right questions.

03

Defense in depth means layers that independently block attacks, not multiple copies of the same control. A firewall and a WAF at the same layer is not defense in depth. An auth check at the API gateway AND in the service handler is.

04

The principle of least privilege is not about users. At the system level it means: each service should have exactly the permissions it needs to do its job, and no more. A read-only analytics service that can write to the payments table is a ticking clock.

05

Environment variables are not secrets management. They're slightly better than hardcoding. A real secrets management system has: rotation, audit logs, fine-grained access control, and automatic expiry. SECRET_KEY=abc123 in a .env file has none of these.

06

Blast radius is a design decision. When you design your service boundaries, you are also designing the blast radius of a compromise. Small, well-isolated services with narrow permissions limit what an attacker can do after they get in.

07

The attack surface of a distributed system is larger than a monolith, not smaller. Every network hop, every queue, every config file, every dependency is a potential entry point. Microservices do not improve your security posture by default — they often make it worse.

The Bolt-On Problem

Here's a story that plays out at almost every company.

A team builds a new service. It's a data processing pipeline — nothing fancy. The service reads from a database, does some computation, writes results to a queue. The team is focused on performance and correctness. They ship it. It works. Six months later, a security engineer does an audit and finds that the service has full read/write access to every table in the database, the queue accepts messages from any source with no authentication, the service logs contain customer PII, and the service's API key is stored in a config file checked into the git repository.

None of this was malicious. Nobody made a decision to be insecure. Security just wasn't part of the design conversation. It was assumed to be "the security team's problem" or "something we'll deal with before GA."

The hard truth is: once a system is built this way, you can't fix it with a security patch. You have to redesign it. The service's access model is baked into its architecture. The logging format is used by three other systems that parse the logs. The API key rotation would require coordinated changes across six services.

The Core Problem

Security added after the fact is expensive not because the security work is hard, but because it requires undoing architectural decisions that other parts of the system have built on top of. The later you add it, the more you're paying to undo.

This is what it means to say security is a constraint, not a feature. A feature can be added to a working system. A constraint shapes the design of the system before a single line of code is written.

Threat Modeling: The Skill You Weren't Taught

Threat modeling sounds intimidating. It sounds like something that requires a team of security specialists, a special framework, and several weeks of workshops. It doesn't. At its core, threat modeling is just four questions asked in order.

The Four Questions

1. What are we building?

Before you can think about what can go wrong, you need a clear picture of what you're building. This is a diagram showing the components of your system and how data moves between them. Every box is an asset. Every arrow is a potential attack vector. You're looking for places where data crosses a trust boundary — from the internet to your network, from one service to another, from a service to a database.

// A simple data flow diagram for threat modeling // Each arrow crossing a boundary is a potential threat [Internet] │ │ HTTPS (trust boundary: untrusted → trusted) ▼ [API Gateway] ── auth token ──→ [Auth Service] │ │ Internal RPC (trust boundary: gateway → service) ▼ [Order Service] │ │ │ SQL │ Message ▼ ▼ [Orders DB] [Event Queue] │ │ Consume (trust boundary: queue → worker) ▼ [Fulfillment Worker] │ │ HTTP ▼ [3rd Party Shipping API] (trust boundary: your network → external)

Every arrow on this diagram where data crosses a trust boundary is where you need to think carefully. The internet-to-gateway boundary. The queue-to-worker boundary. The call to the third-party shipping API.

2. What can go wrong?

This is where most people think you need security expertise. You don't. You need a structured way to think about adversarial behavior. The most useful framework is called STRIDE. For each trust boundary in your diagram, ask whether an attacker could:

Threat What it means Simple example
Spoofing Pretend to be someone or something they're not An attacker calls your internal order service pretending to be the API gateway
Tampering Modify data in transit or at rest A malicious queue message causes the fulfillment worker to ship to the wrong address
Repudiation Deny performing an action when they did A fraudulent order is placed and there's no audit trail to prove it happened
Information disclosure Access data they shouldn't see Error messages include stack traces that reveal database schema
Denial of service Make the system unavailable Flooding the queue with malformed messages halts the fulfillment worker
Elevation of privilege Do more than they're authorized to do A regular user somehow triggers an admin action by manipulating a request parameter

You don't need to find every possible threat. You need to find the plausible ones. An attacker who can modify messages in your internal queue is far more plausible than an attacker who can physically intercept your datacenter cables. Focus on the former.

3. What are we going to do about it?

For each threat you found, you have four options:

Mitigate: Reduce the risk. Add authentication to the queue consumer. Validate message schemas.
Eliminate: Remove the attack vector entirely. Don't put that data in the message at all — look it up from a trusted source instead.
Transfer: Move the risk to someone better positioned to handle it. Use a managed message queue service that handles authentication for you.
Accept: Decide the risk is low enough that the cost of mitigation isn't worth it. This is a legitimate choice — but it must be a conscious decision, documented and understood.

4. Did we do a good job?

This is the review step. Did you cover all the trust boundaries? Did any "accept" decisions get reviewed by the right people? Are the mitigations actually implemented and not just planned? The answer to this question is never "yes" the first time. It's a forcing function to make sure you revisit.

Practical Advice

Run a threat model during the design phase, not after. A 90-minute session with the tech lead, one engineer, and a data flow diagram will surface 80% of the real risks. The goal is not to find every possible attack — it's to find the ones that would make headlines.

Defense in Depth: What It Actually Means

Defense in depth is one of the most misunderstood concepts in security. People often interpret it as "add more security controls." That's wrong. Defense in depth means having multiple independent layers where each layer can stop an attack that got past the previous one.

The word to focus on is independent. Two controls at the same layer that can both be defeated the same way are not defense in depth. A firewall and a WAF that both rely on IP-based rules — an attacker who can spoof or rotate IPs defeats both simultaneously.

What Independent Layers Look Like in Practice

Consider a request flowing from an external user to a sensitive database:

[User request] │ ▼ ┌─────────────────────────────────────────────────────┐ │ Layer 1: Network │ │ - TLS: request is encrypted in transit │ │ - Firewall: only port 443 is open │ │ - DDoS protection: rate limiting at the edge │ └─────────────────────┬───────────────────────────────┘ │ (attacker needs to defeat Layer 1 OR find a way around it) ▼ ┌─────────────────────────────────────────────────────┐ │ Layer 2: Application Gateway │ │ - Authentication: valid JWT required │ │ - WAF: blocks known attack patterns │ │ - Request validation: schema checked here │ └─────────────────────┬───────────────────────────────┘ │ (even if attacker has a valid JWT, next layer applies) ▼ ┌─────────────────────────────────────────────────────┐ │ Layer 3: Service │ │ - Authorization: does this user have permission? │ │ - Input sanitization: re-validate at service level │ │ - Audit logging: record every action │ └─────────────────────┬───────────────────────────────┘ │ (even if attacker reaches the service, next layer applies) ▼ ┌─────────────────────────────────────────────────────┐ │ Layer 4: Data │ │ - Database user has minimum required permissions │ │ - Row-level security where supported │ │ - Encryption at rest for sensitive fields │ └─────────────────────────────────────────────────────┘

Notice that each layer can independently stop an attack. An attacker who gets past the firewall still hits authentication. An attacker with a stolen JWT still hits service-level authorization. An attacker who compromises the service still faces a database user with limited permissions.

Common Mistake

Many teams implement authentication at the API gateway and then trust all requests inside the network perimeter. This means any compromised internal service — a low-sensitivity worker, a misconfigured container — can call any other internal service without restriction. The network perimeter is a single point of failure for your entire internal trust model.

The Distinction Between Authentication and Authorization

These two words are often used interchangeably. They are completely different things.

Authentication answers: Who are you? It verifies identity. A valid JWT proves that the token was issued by your auth system, not that the holder is allowed to do what they're trying to do.

Authorization answers: What are you allowed to do? It checks permissions. It happens after authentication and at every layer, not just at the edge.

The mistake is checking authentication at the gateway and then skipping authorization in the service because "the gateway already checked this." The gateway checked that the request is from a real user. It didn't check whether that specific user has permission to access that specific resource.

The Principle of Least Privilege at the System Level

Most engineers know about least privilege in the context of user accounts — don't give a user admin rights when they only need read access. The same principle applies to services, and it's even more important, because services operate continuously and automatically.

Service Permissions Are a Design Decision

Every service in your system has a set of things it needs to do. It might need to read from one database table, write to one other table, and publish to one message queue. That's its job.

The temptation — especially early in a project — is to give services broad permissions because it's easier. Give the service a database user with read/write access to the whole database. Give it an IAM role with broad S3 access. It's less work now. You'll "tighten it up later."

Later never comes.

The Cost of Overly Broad Permissions

If the order processing service has write access to the payments table and an attacker finds a SQL injection vulnerability in the order processing service, they can now directly manipulate payment records. The blast radius of the vulnerability just expanded from "the attacker can mess with orders" to "the attacker can mess with payments." Same bug, completely different consequence.

How to Actually Apply Least Privilege

The practical approach is to define permissions from the service's perspective, not from the database's perspective. Ask: "What does this service need to do?" and then create credentials that allow exactly that and nothing else.

-- Instead of one powerful user for all services:
-- CREATE USER app WITH PASSWORD '...';
-- GRANT ALL ON ALL TABLES TO app;

-- Create specific users per service:
CREATE USER order_service WITH PASSWORD '...';
GRANT SELECT, INSERT ON orders TO order_service;
GRANT SELECT ON customers TO order_service;
-- order_service cannot touch payments, cannot delete, cannot access other tables

CREATE USER fulfillment_worker WITH PASSWORD '...';
GRANT SELECT ON orders TO fulfillment_worker;
GRANT UPDATE (status, fulfilled_at) ON orders TO fulfillment_worker;
-- fulfillment_worker can only read orders and update two specific columns

This seems tedious. It is, a little. But the payoff is that when (not if) a service is compromised, the attacker's capability is bounded by that service's permissions, not by the permissions of the most powerful service in your system.

The Same Principle Applies to Cloud IAM

Cloud IAM roles are where least privilege breaks down most visibly in modern systems. It's common to see a service running with an IAM role that has s3:* on * (every action on every bucket) because someone needed to write to one specific bucket and didn't want to figure out the exact permission.

What you often see What you should have
s3:* on * s3:GetObject, s3:PutObject on arn:aws:s3:::my-specific-bucket/*
iam:* for a deployment role Specific permissions to assume a specific role
One IAM role for all microservices One IAM role per service with only what that service needs
Admin permissions because "it's just internal" Read-only for services that only read, no access to resources they don't touch

Secrets Management: The Problem With Environment Variables

Environment variables are the de facto standard for configuration in containerized systems. DATABASE_URL, API_KEY, JWT_SECRET. They feel secure because they're not hardcoded in source code. But they have a specific set of problems that make them inadequate for serious secrets management.

What Environment Variables Get Wrong

They don't rotate. An environment variable set at deploy time stays the same until the next deploy. If your database password is compromised, you need to redeploy every service that uses it. Real secrets management systems support rotation — the secret changes on a schedule, and the system automatically gets the new value.

They have no audit trail. Who read the database password? When? From which service? Environment variables have no concept of an access log. A secrets manager like Vault or AWS Secrets Manager records every access. You can answer "was this secret accessed between 2am and 4am on Tuesday?" which is exactly the question you need to answer during an incident.

They leak easily. Environment variables show up in crash dumps, in ps aux output, in some container orchestration logs, in debug endpoints, in bug reports. Developers print environment variables to debug configuration issues and forget to remove the log line. A real secrets management system controls precisely when and how a secret is exposed.

They have no fine-grained access control. If a service can read one environment variable, it can typically read all of them. A secrets manager lets you say "the analytics service can read the read-only database password but not the read-write database password."

The Spectrum of Secrets Management

You don't have to go from environment variables to a full Vault deployment overnight. There's a spectrum, and the right place on it depends on your threat model and operational maturity.

Level 1 — Hardcoded (never do this) ──────────────────────────────────── api_key = "sk_live_abc123xyz" ← in source code, in git history forever Level 2 — Environment variables ──────────────────────────────────── api_key = os.environ["API_KEY"] ← not in code, but no rotation/audit Level 3 — Encrypted secrets in config store ──────────────────────────────────── api_key = config_service.get("api_key") ← encrypted at rest, centralized but still manual rotation, limited audit Level 4 — Secrets manager with dynamic secrets ──────────────────────────────────── creds = vault.get_database_credentials() ← rotated automatically audit logs on every access fine-grained access control short-lived credentials expire if leaked Level 5 — Short-lived credentials with workload identity ──────────────────────────────────── No secrets stored anywhere Service proves its identity (SPIFFE/SPIRE or cloud IAM) Gets credentials on-demand, credentials expire in minutes

Most companies should target Level 4. Level 5 is where you want to end up for the most sensitive credentials, but Level 4 solves 90% of the real problems. Level 2 is acceptable for low-sensitivity configuration (which log level to use, which feature flags are on) but not for credentials that provide access to data.

The Dynamic Credentials Insight

The most powerful feature of a real secrets management system is dynamic secrets. Instead of storing a database password, the system generates a fresh one on demand, grants it access, and revokes it after a short time window. If an attacker steals this credential, it's already expired. This is the difference between a secret that's dangerous if leaked and one that's useless if leaked.

The Git Repository Is Not a Secrets Store

This seems obvious but deserves its own section because it keeps happening. Secrets in git are permanent. Even if you delete the file and push the deletion, the secret exists in the git history. Anyone with access to the repository — now or in the future — can find it with git log -p.

If you ever commit a secret to a git repository, even momentarily, treat it as compromised. Revoke it and issue a new one. This is not paranoia. Automated secret scanners run continuously against public and private repositories. GitHub's own secret scanning has found millions of leaked credentials. The time between a push and a credential being found and abused is measured in minutes, not days.

Blast Radius: Security as an Architecture Decision

Here's a way to think about security that connects it directly to architectural decisions you're already making: when you design service boundaries, you are also designing the blast radius of a compromise.

Blast radius is the answer to the question: if this service is fully compromised, what can the attacker do?

The Three Dimensions of Blast Radius

Data access: What data can the attacker read or modify? A service with access to customer PII, payment information, and business metrics has a much larger blast radius than a service that can only access its own configuration data.

Action scope: What actions can the attacker take? Can they issue refunds? Can they provision infrastructure? Can they send emails to users? Can they escalate to other services?

Lateral movement: Can the attacker use this compromised service as a stepping stone to compromise other services? If the service has credentials for other services, the attacker now has a foothold to pivot.

The Lateral Movement Problem

Lateral movement is why credential sharing between services is so dangerous. If Service A has credentials to call Service B, and Service B has credentials to call Service C, a compromise of Service A is effectively a compromise of the entire chain. This is how attackers move from a low-sensitivity service to a high-sensitivity one — by following the credential graph.

Designing Small Blast Radii

The goal is to make the blast radius of any single compromise as small as possible. This doesn't mean making every service tiny for its own sake — it means thinking explicitly about what each service needs access to and whether that access is truly necessary.

Ask these questions for each service:

What is the minimum set of database tables this service needs access to?
Does this service need to call every other service, or only specific ones?
Does this service need write access, or would read-only access be sufficient?
If this service is compromised, what is the worst case? Is that acceptable?
Does this service have credentials that would allow an attacker to reach a higher-sensitivity service?

The Security Properties Distributed Systems Break

Monolithic systems have their own security problems. But distributed systems introduce a specific set of properties that are harder to get right because the system has more moving parts, more trust boundaries, and more ways for an attacker to get in.

The Network Is Now Part of Your Attack Surface

In a monolith, service calls are function calls — they happen in process, in memory. There's nothing to intercept. In a distributed system, every service call crosses a network. That network can be eavesdropped, and messages can be intercepted or replayed.

This is why transport security (TLS) is not optional for internal service communication. It's not just about preventing eavesdropping — it's about preventing replay attacks and man-in-the-middle attacks on calls that your services trust implicitly.

The standard pattern is mutual TLS (mTLS), where both sides of a connection prove their identity. Chapter 35 goes deeper on this. The point here is: the moment services communicate over a network, even an internal one, they are communicating over an attack surface.

Configuration Is Now a Security Surface

In a monolith, configuration is usually a file on disk. In a distributed system, configuration comes from environment variables, config maps, service meshes, feature flag systems, and secrets managers. Each of these is a place where an attacker can potentially inject malicious configuration.

A misconfigured feature flag could enable a code path that bypasses authorization. A tampered config map could redirect service traffic to an attacker-controlled endpoint. Validate configuration at startup and treat configuration changes with the same caution as code changes.

Logging Is Now a Security Feature

In distributed systems, logs are often the only way to reconstruct what happened during an attack. This makes logging a security control, not just an operational one. But logs can also be a security liability if they contain sensitive data.

The tension is real: you want to log enough to understand what happened, but you don't want to log so much that the logs themselves become a valuable target.

A practical approach:

Do log: authentication events (success and failure), authorization decisions (especially denials), all data mutations with who made them, and errors with their context
Do not log: passwords or tokens (even hashed), full credit card numbers, social security numbers, full request bodies that may contain sensitive user data
Log carefully: user IDs (useful for debugging but PII), IP addresses (useful for security but may require consent in some jurisdictions), query parameters (may contain sensitive data)

Dependency Supply Chain Is a Distributed System Problem

Your system doesn't just include your code. It includes every library you depend on, and every dependency of those libraries. A supply chain attack — where an attacker compromises a popular open source library — affects every system that uses it.

This is not a hypothetical. The log4shell vulnerability in 2021 affected millions of systems because it was in a ubiquitous Java logging library. The event-stream npm package compromise in 2018 was a targeted attack against a specific cryptocurrency wallet, injected through a dependency of a dependency of a dependency.

Mitigations that are worth doing:

Pin dependency versions in production — package-lock.json, poetry.lock, go.sum. Unpinned dependencies mean a new deploy could pick up a newly-compromised version.
Run automated dependency scanning (Dependabot, Snyk, OWASP Dependency-Check) and act on the alerts. Alerts that nobody acts on are just noise.
Have a process for emergency dependency updates. When a critical vulnerability is announced in a dependency you use, you need to be able to patch and deploy in hours, not weeks.
Be suspicious of new or unusual dependencies. A library you've never heard of that has one contributor and was published last month deserves more scrutiny than a library with ten years of history.

Input Validation: Where Most Attacks Actually Start

The majority of attacks against distributed systems — SQL injection, command injection, server-side request forgery, cross-site scripting — start with malicious input that the system processes without sufficient validation.

The rule is simple: validate at every trust boundary, not just at the edge.

The API gateway validated the input when it came in from the internet. But then that validated data gets put in a message queue. The message queue delivers it to a worker service. The worker service assumes the data is valid because "it was validated earlier." The worker doesn't validate. The worker has a bug.

The Validator Assumption Problem

Every service that assumes upstream validation has happened is a potential vulnerability. Upstream validation might have been bypassed, misconfigured, or simply not have covered your specific use case. Validation at a trust boundary is a defense-in-depth control. If you only do it once, you don't have depth.

The practical rule: any service that receives data from outside its own process should validate that data before using it. "Outside its own process" includes message queues, databases (data written by another service), HTTP calls, and configuration files.

Audit Logging: The Record That Matters After an Incident

When something goes wrong — a breach, a data leak, an unauthorized action — the first question is always "what happened?" An audit log is the record that lets you answer that question.

An audit log is different from an application log. An application log is for debugging and operations. An audit log is a legal and security record of who did what and when.

What makes a good audit log entry:

{
  "timestamp": "2024-01-15T14:32:07.123Z",  // precise, immutable
  "event_type": "payment.refund.issued",       // what happened
  "actor": {
    "type": "user",
    "id": "usr_7x8k2m",                        // who did it
    "ip": "203.0.113.42",                      // from where
    "session_id": "ses_9p3n4r"                 // which session
  },
  "resource": {
    "type": "payment",
    "id": "pay_2m9k4x",
    "amount": 4999                             // what was affected
  },
  "outcome": "success",                        // did it work
  "request_id": "req_4k2m8p"                  // for correlation
}

Audit logs have specific requirements that differ from regular logs:

They must be tamper-evident. If an attacker can modify the audit log after a breach, the log is worthless. Audit logs should be written to an append-only store that the compromised service cannot modify — a separate logging system, a write-only bucket, or a dedicated audit database.

They must be retained for the right duration. Regulatory requirements vary, but security investigations often need to look back months. A 7-day log retention is fine for application debugging. For audit logs, the minimum is typically 90 days, and often 1-2 years.

They must be queryable. A massive log file that nobody can search is not useful during an incident. Audit logs should be stored in a system where you can answer "show me all actions taken by user X between these two timestamps" quickly.

The One Principle

Security is not a layer you add on top of a system — it is a property of how the system was designed. A system that was not designed with security as a constraint is not a secure system with security gaps; it is a fundamentally different system.

The Most Common Mistake

Treating the network perimeter as the only trust boundary, and then trusting all internal traffic unconditionally. Once an attacker is inside the network — via a compromised container, a misconfigured service, or a supply chain attack — they can call every internal service without restriction. The perimeter is not a wall; it is one layer of many.

Three Questions for Your Next Design Review
  1. Draw the trust boundaries on your architecture diagram. For each boundary, what prevents an attacker who controls one side from impersonating, tampering, or escalating on the other side?
  2. If your most exposed public-facing service is fully compromised right now, what is the maximum data the attacker can access and the maximum actions they can take? Is that blast radius acceptable?
  3. Where are your secrets stored, who has access to them, and when was the last time they were rotated? Could you rotate them all in under two hours if required?
← Table of Contents Chapter 35: Trust in Distributed Systems →