Why bolt-on security always fails, and how to design systems where security is load-bearing from the start
Security is a constraint, not a feature. A feature is optional and addable later. A constraint shapes every decision from the start. Systems designed without security as a constraint are not "insecure" — they are fundamentally different systems that happen to look similar.
Threat modeling is just asking four questions: What are we building? What can go wrong? What are we going to do about it? Did we do a good job? You don't need a security team to do this — you need 90 minutes and the right questions.
Defense in depth means layers that independently block attacks, not multiple copies of the same control. A firewall and a WAF at the same layer is not defense in depth. An auth check at the API gateway AND in the service handler is.
The principle of least privilege is not about users. At the system level it means: each service should have exactly the permissions it needs to do its job, and no more. A read-only analytics service that can write to the payments table is a ticking clock.
Environment variables are not secrets management. They're slightly better than hardcoding. A real secrets management system has: rotation, audit logs, fine-grained access control, and automatic expiry. SECRET_KEY=abc123 in a .env file has none of these.
Blast radius is a design decision. When you design your service boundaries, you are also designing the blast radius of a compromise. Small, well-isolated services with narrow permissions limit what an attacker can do after they get in.
The attack surface of a distributed system is larger than a monolith, not smaller. Every network hop, every queue, every config file, every dependency is a potential entry point. Microservices do not improve your security posture by default — they often make it worse.
Here's a story that plays out at almost every company.
A team builds a new service. It's a data processing pipeline — nothing fancy. The service reads from a database, does some computation, writes results to a queue. The team is focused on performance and correctness. They ship it. It works. Six months later, a security engineer does an audit and finds that the service has full read/write access to every table in the database, the queue accepts messages from any source with no authentication, the service logs contain customer PII, and the service's API key is stored in a config file checked into the git repository.
None of this was malicious. Nobody made a decision to be insecure. Security just wasn't part of the design conversation. It was assumed to be "the security team's problem" or "something we'll deal with before GA."
The hard truth is: once a system is built this way, you can't fix it with a security patch. You have to redesign it. The service's access model is baked into its architecture. The logging format is used by three other systems that parse the logs. The API key rotation would require coordinated changes across six services.
Security added after the fact is expensive not because the security work is hard, but because it requires undoing architectural decisions that other parts of the system have built on top of. The later you add it, the more you're paying to undo.
This is what it means to say security is a constraint, not a feature. A feature can be added to a working system. A constraint shapes the design of the system before a single line of code is written.
Threat modeling sounds intimidating. It sounds like something that requires a team of security specialists, a special framework, and several weeks of workshops. It doesn't. At its core, threat modeling is just four questions asked in order.
1. What are we building?
Before you can think about what can go wrong, you need a clear picture of what you're building. This is a diagram showing the components of your system and how data moves between them. Every box is an asset. Every arrow is a potential attack vector. You're looking for places where data crosses a trust boundary — from the internet to your network, from one service to another, from a service to a database.
Every arrow on this diagram where data crosses a trust boundary is where you need to think carefully. The internet-to-gateway boundary. The queue-to-worker boundary. The call to the third-party shipping API.
2. What can go wrong?
This is where most people think you need security expertise. You don't. You need a structured way to think about adversarial behavior. The most useful framework is called STRIDE. For each trust boundary in your diagram, ask whether an attacker could:
| Threat | What it means | Simple example |
|---|---|---|
| Spoofing | Pretend to be someone or something they're not | An attacker calls your internal order service pretending to be the API gateway |
| Tampering | Modify data in transit or at rest | A malicious queue message causes the fulfillment worker to ship to the wrong address |
| Repudiation | Deny performing an action when they did | A fraudulent order is placed and there's no audit trail to prove it happened |
| Information disclosure | Access data they shouldn't see | Error messages include stack traces that reveal database schema |
| Denial of service | Make the system unavailable | Flooding the queue with malformed messages halts the fulfillment worker |
| Elevation of privilege | Do more than they're authorized to do | A regular user somehow triggers an admin action by manipulating a request parameter |
You don't need to find every possible threat. You need to find the plausible ones. An attacker who can modify messages in your internal queue is far more plausible than an attacker who can physically intercept your datacenter cables. Focus on the former.
3. What are we going to do about it?
For each threat you found, you have four options:
4. Did we do a good job?
This is the review step. Did you cover all the trust boundaries? Did any "accept" decisions get reviewed by the right people? Are the mitigations actually implemented and not just planned? The answer to this question is never "yes" the first time. It's a forcing function to make sure you revisit.
Run a threat model during the design phase, not after. A 90-minute session with the tech lead, one engineer, and a data flow diagram will surface 80% of the real risks. The goal is not to find every possible attack — it's to find the ones that would make headlines.
Defense in depth is one of the most misunderstood concepts in security. People often interpret it as "add more security controls." That's wrong. Defense in depth means having multiple independent layers where each layer can stop an attack that got past the previous one.
The word to focus on is independent. Two controls at the same layer that can both be defeated the same way are not defense in depth. A firewall and a WAF that both rely on IP-based rules — an attacker who can spoof or rotate IPs defeats both simultaneously.
Consider a request flowing from an external user to a sensitive database:
Notice that each layer can independently stop an attack. An attacker who gets past the firewall still hits authentication. An attacker with a stolen JWT still hits service-level authorization. An attacker who compromises the service still faces a database user with limited permissions.
Many teams implement authentication at the API gateway and then trust all requests inside the network perimeter. This means any compromised internal service — a low-sensitivity worker, a misconfigured container — can call any other internal service without restriction. The network perimeter is a single point of failure for your entire internal trust model.
These two words are often used interchangeably. They are completely different things.
Authentication answers: Who are you? It verifies identity. A valid JWT proves that the token was issued by your auth system, not that the holder is allowed to do what they're trying to do.
Authorization answers: What are you allowed to do? It checks permissions. It happens after authentication and at every layer, not just at the edge.
The mistake is checking authentication at the gateway and then skipping authorization in the service because "the gateway already checked this." The gateway checked that the request is from a real user. It didn't check whether that specific user has permission to access that specific resource.
Most engineers know about least privilege in the context of user accounts — don't give a user admin rights when they only need read access. The same principle applies to services, and it's even more important, because services operate continuously and automatically.
Every service in your system has a set of things it needs to do. It might need to read from one database table, write to one other table, and publish to one message queue. That's its job.
The temptation — especially early in a project — is to give services broad permissions because it's easier. Give the service a database user with read/write access to the whole database. Give it an IAM role with broad S3 access. It's less work now. You'll "tighten it up later."
Later never comes.
If the order processing service has write access to the payments table and an attacker finds a SQL injection vulnerability in the order processing service, they can now directly manipulate payment records. The blast radius of the vulnerability just expanded from "the attacker can mess with orders" to "the attacker can mess with payments." Same bug, completely different consequence.
The practical approach is to define permissions from the service's perspective, not from the database's perspective. Ask: "What does this service need to do?" and then create credentials that allow exactly that and nothing else.
-- Instead of one powerful user for all services:
-- CREATE USER app WITH PASSWORD '...';
-- GRANT ALL ON ALL TABLES TO app;
-- Create specific users per service:
CREATE USER order_service WITH PASSWORD '...';
GRANT SELECT, INSERT ON orders TO order_service;
GRANT SELECT ON customers TO order_service;
-- order_service cannot touch payments, cannot delete, cannot access other tables
CREATE USER fulfillment_worker WITH PASSWORD '...';
GRANT SELECT ON orders TO fulfillment_worker;
GRANT UPDATE (status, fulfilled_at) ON orders TO fulfillment_worker;
-- fulfillment_worker can only read orders and update two specific columns
This seems tedious. It is, a little. But the payoff is that when (not if) a service is compromised, the attacker's capability is bounded by that service's permissions, not by the permissions of the most powerful service in your system.
Cloud IAM roles are where least privilege breaks down most visibly in modern systems. It's common to see a service running with an IAM role that has s3:* on * (every action on every bucket) because someone needed to write to one specific bucket and didn't want to figure out the exact permission.
| What you often see | What you should have |
|---|---|
s3:* on * |
s3:GetObject, s3:PutObject on arn:aws:s3:::my-specific-bucket/* |
iam:* for a deployment role |
Specific permissions to assume a specific role |
| One IAM role for all microservices | One IAM role per service with only what that service needs |
| Admin permissions because "it's just internal" | Read-only for services that only read, no access to resources they don't touch |
Environment variables are the de facto standard for configuration in containerized systems. DATABASE_URL, API_KEY, JWT_SECRET. They feel secure because they're not hardcoded in source code. But they have a specific set of problems that make them inadequate for serious secrets management.
They don't rotate. An environment variable set at deploy time stays the same until the next deploy. If your database password is compromised, you need to redeploy every service that uses it. Real secrets management systems support rotation — the secret changes on a schedule, and the system automatically gets the new value.
They have no audit trail. Who read the database password? When? From which service? Environment variables have no concept of an access log. A secrets manager like Vault or AWS Secrets Manager records every access. You can answer "was this secret accessed between 2am and 4am on Tuesday?" which is exactly the question you need to answer during an incident.
They leak easily. Environment variables show up in crash dumps, in ps aux output, in some container orchestration logs, in debug endpoints, in bug reports. Developers print environment variables to debug configuration issues and forget to remove the log line. A real secrets management system controls precisely when and how a secret is exposed.
They have no fine-grained access control. If a service can read one environment variable, it can typically read all of them. A secrets manager lets you say "the analytics service can read the read-only database password but not the read-write database password."
You don't have to go from environment variables to a full Vault deployment overnight. There's a spectrum, and the right place on it depends on your threat model and operational maturity.
Most companies should target Level 4. Level 5 is where you want to end up for the most sensitive credentials, but Level 4 solves 90% of the real problems. Level 2 is acceptable for low-sensitivity configuration (which log level to use, which feature flags are on) but not for credentials that provide access to data.
The most powerful feature of a real secrets management system is dynamic secrets. Instead of storing a database password, the system generates a fresh one on demand, grants it access, and revokes it after a short time window. If an attacker steals this credential, it's already expired. This is the difference between a secret that's dangerous if leaked and one that's useless if leaked.
This seems obvious but deserves its own section because it keeps happening. Secrets in git are permanent. Even if you delete the file and push the deletion, the secret exists in the git history. Anyone with access to the repository — now or in the future — can find it with git log -p.
If you ever commit a secret to a git repository, even momentarily, treat it as compromised. Revoke it and issue a new one. This is not paranoia. Automated secret scanners run continuously against public and private repositories. GitHub's own secret scanning has found millions of leaked credentials. The time between a push and a credential being found and abused is measured in minutes, not days.
Here's a way to think about security that connects it directly to architectural decisions you're already making: when you design service boundaries, you are also designing the blast radius of a compromise.
Blast radius is the answer to the question: if this service is fully compromised, what can the attacker do?
Data access: What data can the attacker read or modify? A service with access to customer PII, payment information, and business metrics has a much larger blast radius than a service that can only access its own configuration data.
Action scope: What actions can the attacker take? Can they issue refunds? Can they provision infrastructure? Can they send emails to users? Can they escalate to other services?
Lateral movement: Can the attacker use this compromised service as a stepping stone to compromise other services? If the service has credentials for other services, the attacker now has a foothold to pivot.
Lateral movement is why credential sharing between services is so dangerous. If Service A has credentials to call Service B, and Service B has credentials to call Service C, a compromise of Service A is effectively a compromise of the entire chain. This is how attackers move from a low-sensitivity service to a high-sensitivity one — by following the credential graph.
The goal is to make the blast radius of any single compromise as small as possible. This doesn't mean making every service tiny for its own sake — it means thinking explicitly about what each service needs access to and whether that access is truly necessary.
Ask these questions for each service:
Monolithic systems have their own security problems. But distributed systems introduce a specific set of properties that are harder to get right because the system has more moving parts, more trust boundaries, and more ways for an attacker to get in.
In a monolith, service calls are function calls — they happen in process, in memory. There's nothing to intercept. In a distributed system, every service call crosses a network. That network can be eavesdropped, and messages can be intercepted or replayed.
This is why transport security (TLS) is not optional for internal service communication. It's not just about preventing eavesdropping — it's about preventing replay attacks and man-in-the-middle attacks on calls that your services trust implicitly.
The standard pattern is mutual TLS (mTLS), where both sides of a connection prove their identity. Chapter 35 goes deeper on this. The point here is: the moment services communicate over a network, even an internal one, they are communicating over an attack surface.
In a monolith, configuration is usually a file on disk. In a distributed system, configuration comes from environment variables, config maps, service meshes, feature flag systems, and secrets managers. Each of these is a place where an attacker can potentially inject malicious configuration.
A misconfigured feature flag could enable a code path that bypasses authorization. A tampered config map could redirect service traffic to an attacker-controlled endpoint. Validate configuration at startup and treat configuration changes with the same caution as code changes.
In distributed systems, logs are often the only way to reconstruct what happened during an attack. This makes logging a security control, not just an operational one. But logs can also be a security liability if they contain sensitive data.
The tension is real: you want to log enough to understand what happened, but you don't want to log so much that the logs themselves become a valuable target.
A practical approach:
Your system doesn't just include your code. It includes every library you depend on, and every dependency of those libraries. A supply chain attack — where an attacker compromises a popular open source library — affects every system that uses it.
This is not a hypothetical. The log4shell vulnerability in 2021 affected millions of systems because it was in a ubiquitous Java logging library. The event-stream npm package compromise in 2018 was a targeted attack against a specific cryptocurrency wallet, injected through a dependency of a dependency of a dependency.
Mitigations that are worth doing:
package-lock.json, poetry.lock, go.sum. Unpinned dependencies mean a new deploy could pick up a newly-compromised version.The majority of attacks against distributed systems — SQL injection, command injection, server-side request forgery, cross-site scripting — start with malicious input that the system processes without sufficient validation.
The rule is simple: validate at every trust boundary, not just at the edge.
The API gateway validated the input when it came in from the internet. But then that validated data gets put in a message queue. The message queue delivers it to a worker service. The worker service assumes the data is valid because "it was validated earlier." The worker doesn't validate. The worker has a bug.
Every service that assumes upstream validation has happened is a potential vulnerability. Upstream validation might have been bypassed, misconfigured, or simply not have covered your specific use case. Validation at a trust boundary is a defense-in-depth control. If you only do it once, you don't have depth.
The practical rule: any service that receives data from outside its own process should validate that data before using it. "Outside its own process" includes message queues, databases (data written by another service), HTTP calls, and configuration files.
When something goes wrong — a breach, a data leak, an unauthorized action — the first question is always "what happened?" An audit log is the record that lets you answer that question.
An audit log is different from an application log. An application log is for debugging and operations. An audit log is a legal and security record of who did what and when.
What makes a good audit log entry:
{
"timestamp": "2024-01-15T14:32:07.123Z", // precise, immutable
"event_type": "payment.refund.issued", // what happened
"actor": {
"type": "user",
"id": "usr_7x8k2m", // who did it
"ip": "203.0.113.42", // from where
"session_id": "ses_9p3n4r" // which session
},
"resource": {
"type": "payment",
"id": "pay_2m9k4x",
"amount": 4999 // what was affected
},
"outcome": "success", // did it work
"request_id": "req_4k2m8p" // for correlation
}
Audit logs have specific requirements that differ from regular logs:
They must be tamper-evident. If an attacker can modify the audit log after a breach, the log is worthless. Audit logs should be written to an append-only store that the compromised service cannot modify — a separate logging system, a write-only bucket, or a dedicated audit database.
They must be retained for the right duration. Regulatory requirements vary, but security investigations often need to look back months. A 7-day log retention is fine for application debugging. For audit logs, the minimum is typically 90 days, and often 1-2 years.
They must be queryable. A massive log file that nobody can search is not useful during an incident. Audit logs should be stored in a system where you can answer "show me all actions taken by user X between these two timestamps" quickly.
Security is not a layer you add on top of a system — it is a property of how the system was designed. A system that was not designed with security as a constraint is not a secure system with security gaps; it is a fundamentally different system.
Treating the network perimeter as the only trust boundary, and then trusting all internal traffic unconditionally. Once an attacker is inside the network — via a compromised container, a misconfigured service, or a supply chain attack — they can call every internal service without restriction. The perimeter is not a wall; it is one layer of many.