Appendix D — Glossary of Precise Terms

Why Precision Matters

"Consistency" alone is used to mean at least four different things in distributed systems literature. When a team argues about whether their system is "consistent," they may be using the same word to mean different things — and designing past each other without realizing it. The goal of this glossary is to give you terms precise enough to end those arguments.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Availability (also: high availability, HA)Reliability

The fraction of time a system is operational and able to serve requests correctly. Expressed as a percentage (99.9%, 99.99%) or in terms of allowed downtime per period.

Precise use: In CAP theorem, availability means "every request receives a response" — it says nothing about the response being correct or up-to-date. In operational SLOs, availability typically means "correct responses within a defined latency budget." These are different things. Specify which you mean.

ACIDDatabases

Atomicity, Consistency, Isolation, Durability — the four properties that database transactions are traditionally said to guarantee. Each letter describes a different guarantee: Atomicity (all-or-nothing), Consistency (valid state transitions only), Isolation (concurrent transactions behave as if sequential), Durability (committed data survives crashes).

Watch out: The "C" in ACID (consistency) means application-level invariants are preserved. This is completely different from the "C" in CAP (consistency), which refers to linearizability. These two uses of the same word in the same field cause enormous confusion.

Anti-entropyReplication

A background process that detects and repairs divergence between replicas by comparing data checksums (typically via Merkle trees) and copying missing or differing values. Used in systems like Dynamo/Cassandra to ensure eventual consistency even after network partitions.

BackpressureFlow Control

A mechanism by which a downstream system signals to an upstream system that it is overloaded, causing the upstream to slow down or stop sending data. The correct response to overload — as opposed to dropping messages silently or crashing.

Common mistake: Most systems implement backpressure too late in the design, after the overload failure mode has already been experienced in production. Designing for backpressure from the start means every queue has a maximum size, and producers handle the "queue full" response explicitly.

Byzantine fault (also: Byzantine failure)Failures

A failure mode where a node behaves arbitrarily — it may send different values to different peers, lie about its state, or act in ways designed to subvert the protocol. Named after the Byzantine Generals Problem. More severe than crash faults or omission faults.

Most enterprise distributed systems do not design for Byzantine faults — they assume nodes are honest but may crash. Byzantine fault tolerance is primarily relevant in adversarial environments (blockchain networks, systems where nodes may be under attacker control).

CAP TheoremFundamentals

A theorem (proved by Gilbert and Lynch, 2002) stating that a distributed system cannot simultaneously guarantee all three of: Consistency (linearizability), Availability (every request gets a response), and Partition tolerance (the system continues operating despite network partitions).

Commonly misused: CAP is not a design menu — you cannot simply "choose CP or AP." Network partitions happen in any real system, so you must tolerate them. The real choice is: during a partition, do you prioritize consistency (refuse requests that might return stale data) or availability (serve potentially stale data rather than refuse)? Also: CAP says nothing about latency, which is often the more practical constraint. See PACELC.

Causal consistencyConsistency Models

A consistency model that preserves causally-related operations in order — if operation A happened before operation B (in the happens-before sense), then every process that sees B must also have seen A. Concurrent operations (with no causal relationship) may be seen in any order.

Causal consistency is weaker than linearizability but stronger than eventual consistency. It is the strongest consistency model that can be provided without sacrificing availability during partitions, making it the practical sweet spot for many distributed databases.

Circuit breakerResilience

A design pattern that prevents a service from repeatedly calling a dependency that is failing. Like an electrical circuit breaker, it has three states: Closed (requests flow normally), Open (requests fail immediately without attempting the call), and Half-Open (a probe request is tried to see if the dependency has recovered).

A circuit breaker is a state machine, not just a flag. The transition from Open to Half-Open after a timeout, and the criteria for transitioning back to Closed, are as important as the initial trip condition.

ConsensusDistributed Algorithms

The problem of getting a group of nodes to agree on a single value, even when some nodes may fail. A consensus algorithm must satisfy: Agreement (all non-faulty nodes decide the same value), Validity (the decided value was proposed by some node), and Termination (all non-faulty nodes eventually decide).

The FLP impossibility result (1985) proves that no deterministic consensus algorithm can guarantee termination in an asynchronous system if even one node can fail. In practice, algorithms like Raft solve this by assuming partial synchrony (messages eventually arrive) rather than full asynchrony.

CQRS (Command Query Responsibility Segregation)Architecture

An architectural pattern that separates the data model used for writes (commands) from the data model used for reads (queries). Allows each to be independently optimized — the write model for consistency and integrity, the read model for query performance.

CQRS adds operational complexity (two models to maintain, eventual consistency between them). It is worth this cost when read and write patterns are genuinely incompatible — not as a default pattern for all services.

DurabilityDatabases

The guarantee that once a transaction is committed, it will survive system failures. In practice, this means data is written to non-volatile storage (disk) before the commit is acknowledged to the caller.

Durability has a cost: fsync() is expensive. Many databases offer ways to trade durability for performance (async replication, write-behind). Understanding what durability guarantee you actually need — and what your database actually provides — is critical for data-loss risk analysis.

Error budgetSRE / Reliability

The acceptable amount of unreliability remaining in a given period, derived from an SLO. If the SLO is 99.9% availability, the error budget is 0.1% — approximately 43 minutes of downtime per month. When the budget is consumed, reliability work takes priority over feature work.

Eventual consistencyConsistency Models

A consistency model where replicas are not guaranteed to be consistent at any given moment, but will eventually converge to the same value if no new updates are made. The weakest consistency model that still provides a useful guarantee.

Commonly overstated: "Eventually consistent" does not mean "will converge quickly" or "will converge in a predictable time." It only means convergence happens if updates stop. In practice, eventual consistency systems need careful design of conflict resolution (what happens when two replicas accept concurrent updates to the same key?) — "last write wins" is a common default that silently discards data.

Event sourcingArchitecture

An approach to persistence where the system stores a sequence of events (things that happened) rather than the current state. Current state is derived by replaying events. The event log is the source of truth; state is a derived projection.

Fan-outArchitecture

The process of propagating one write to many destinations. In social systems, a single post might be fanned out to the feeds of all followers. Fan-out amplifies write QPS: one input write can become thousands of output writes.

Fan-out on write (pre-compute feeds at write time) vs. fan-out on read (compute feeds at read time) is a fundamental trade-off in social feed design. Write-time fan-out favors read performance; read-time fan-out is cheaper for high-follower-count users.

Happens-before (→ notation)Distributed Theory

A partial ordering relationship between events in a distributed system, introduced by Lamport. Event A happens-before event B (written A → B) if: A and B are on the same process and A occurred before B, or A is the sending of a message and B is the receiving of that message, or there exists C such that A → C and C → B (transitivity).

Happens-before is a partial order, not a total order. Two events may be concurrent (neither happens-before the other), meaning no causal relationship exists between them. This is the foundation for vector clocks and causal consistency.

Hot spot (also: hot partition, hot key)Scalability

A partition, shard, or key that receives disproportionately more traffic than others, becoming a bottleneck that limits overall system throughput even if other partitions have spare capacity.

IdempotencyCorrectness

A property of an operation where applying it multiple times produces the same result as applying it once. An idempotent operation can be safely retried without risk of unintended side effects.

Exactly-once semantics in messaging systems is not the same as idempotency. Exactly-once delivery means the broker guarantees the message is delivered exactly once. Idempotent consumers mean the consumer can handle duplicates safely. Both are needed for end-to-end exactly-once processing — and the consumer-side idempotency is usually the more practical guarantee to design for.

Isolation (transaction isolation)Databases

The degree to which concurrent transactions are hidden from each other. The SQL standard defines four isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable — from weakest to strongest guarantee.

Most databases default to Read Committed, not Serializable. This allows anomalies like non-repeatable reads and phantom reads. Serializable isolation is the only level that eliminates all concurrency anomalies, but it has a performance cost. Understand your database's default and whether it is sufficient for your application's correctness requirements.

Linearizability (also: atomic consistency, strong consistency)Consistency Models

The strongest single-object consistency model. A system is linearizable if every operation appears to take effect instantaneously at some point between its invocation and completion, and the result is consistent with a single global ordering of all operations.

Not the same as serializability. Linearizability applies to single-object operations and requires real-time ordering. Serializability applies to multi-object transactions and only requires that results are equivalent to some serial order — not necessarily the real-time order. Strict serializability combines both.

Log-structured merge-tree (LSM tree)Storage Engines

A storage engine design that accumulates writes in memory (memtable), flushes them to immutable sorted files (SSTables) on disk, and periodically merges these files (compaction). Optimized for high write throughput at the cost of read amplification and periodic compaction overhead.

Monotonic readsConsistency Models

A session guarantee that ensures if a client has read a value at time T, any subsequent reads will not return a value from before T. Prevents the disorienting experience of "time going backwards" when a client's requests hit different replicas at different replication states.

mTLS (mutual TLS)Security

A TLS configuration where both the client and server present certificates and authenticate each other, as opposed to standard TLS where only the server is authenticated. Used for service-to-service authentication in zero-trust network architectures.

ObservabilityOperations

The ability to understand the internal state of a system from its external outputs. A system is highly observable if you can diagnose novel failures — failures you have never seen before — from its telemetry alone, without needing to add new instrumentation first.

Different from monitoring. Monitoring answers known questions ("is the error rate above 1%?"). Observability enables unknown questions ("why is this specific user's requests taking 10× longer than everyone else's?"). Structured logs, distributed traces, and high-cardinality metrics are the building blocks of observability.

PACELCFundamentals

An extension of CAP proposed by Daniel Abadi. States that in the case of a network Partition, choose between Availability and Consistency (as in CAP); Else (when no partition), choose between Latency and Consistency. Recognizes that latency, not just partition tolerance, is a fundamental trade-off dimension.

Partition (network partition)Failures

A failure scenario where two or more subsets of nodes in a distributed system cannot communicate with each other, even though nodes within each subset can communicate internally. The system is split into isolated islands.

Post-mortem (also: incident review, retrospective)Operations

A structured analysis of an incident conducted after it is resolved, aimed at understanding what happened, why it happened, and what systemic changes will prevent recurrence or reduce impact. In a blameless culture, the focus is on systemic causes rather than individual actions.

QuorumDistributed Algorithms

In a system with N replicas, a quorum is a subset of replicas that must agree for an operation to be considered complete. For a write quorum W and read quorum R, the system can handle inconsistency-free reads if W + R > N (the write set and read set must overlap).

Quorum does not guarantee strong consistency by itself. Sloppy quorums (allowing writes to any available nodes, not just the designated set) improve availability but can cause reads and writes to go to disjoint sets, violating the W + R > N guarantee.

Queue (message queue, work queue)Architecture

A data structure or service that holds messages until they are consumed. Used to decouple producers from consumers, absorb traffic spikes, and enable asynchronous processing. Key properties: ordering guarantees (FIFO vs. best-effort), delivery guarantees (at-most-once, at-least-once, exactly-once), and retention policy.

Read-your-writes (also: read-your-own-writes)Consistency Models

A session guarantee that ensures a client will always see the results of its own previous writes. Prevents the confusing experience where a user submits data and then immediately queries for it, only to find it absent.

Replication lagReplication

The delay between when a write is committed on the primary/leader and when it is applied on a replica. In asynchronous replication, lag can range from milliseconds to minutes. Under load or during network issues, it can grow significantly.

Replication lag is not a bug — it is the expected behavior of asynchronous replication. It becomes a problem when application code assumes replicas are always current. Reading from a replica that is 30 seconds behind in a financial system can produce incorrect results that are hard to detect.

SagaDistributed Transactions

A pattern for managing long-running transactions across multiple services without distributed locking. A saga is a sequence of local transactions where each step publishes an event or message that triggers the next step. If a step fails, compensating transactions undo the previous steps.

Sagas provide ACD properties (no Isolation) — concurrent sagas can interfere with each other. This is acceptable for many business workflows but requires careful design of compensating transactions, which must be idempotent and cannot always perfectly reverse business state.

SerializabilityDatabases

The strongest multi-object transaction isolation level. A schedule of concurrent transactions is serializable if the outcome is equivalent to some serial (one-at-a-time) execution of those transactions. Does not require real-time ordering (see: Strict Serializability).

SLA / SLO / SLIReliability

SLI (Service Level Indicator): A specific metric measuring service behavior (e.g., request success rate, p99 latency).
SLO (Service Level Objective): A target for an SLI (e.g., "99.9% of requests must complete in under 200ms").
SLA (Service Level Agreement): A contractual commitment, typically to customers, with financial penalties for violation. An SLA is usually a weaker commitment than internal SLOs — teams set SLOs tighter than SLAs to have a buffer.

Split-brainFailures

A failure scenario where a network partition causes two or more nodes to simultaneously believe they are the primary/leader, leading to independent and potentially conflicting writes on each partition. One of the most dangerous failure modes in leader-based replication systems.

Tail latency (p99, p999 latency)Performance

The latency experienced by the slowest requests in a distribution — typically measured at the 99th or 99.9th percentile. p99 latency of 500ms means 1% of requests take 500ms or longer. Tail latency is more important than average latency for user-facing systems because it affects the slowest (often most important) requests.

In a service that makes 100 parallel downstream calls, each at 1ms avg / 100ms p99, the end-to-end p99 is dominated by the slowest call. With 100 calls, you will hit at least one p99 event on almost every request. This is the "tail at scale" problem — optimizing average latency does not help when tail latency is the bottleneck.

ToilOperations / SRE

Work that is manual, repetitive, automatable, tactical, and produces no lasting value beyond getting you to the next iteration. Defined precisely in Google's SRE framework to distinguish it from overhead (inherent management cost) and engineering work (building value).

Two-phase commit (2PC)Distributed Transactions

A distributed commit protocol that coordinates atomic commitment across multiple nodes. Phase 1 (Prepare): the coordinator asks all participants if they can commit. Phase 2 (Commit/Abort): if all say yes, the coordinator sends commit; if any say no, it sends abort. Guarantees atomicity but blocks if the coordinator fails mid-protocol.

Vector clockDistributed Theory

A mechanism for tracking causality in distributed systems. Each node maintains a vector of counters, one per node. On each event, the local counter increments. On message send, the current vector is included. On message receive, the receiver takes the element-wise maximum of its vector and the received vector, then increments its own counter.

Vector clocks allow you to determine whether two events are causally related or concurrent — something logical (Lamport) clocks cannot do. The downside is that the vector grows with the number of nodes, which can become expensive at large cluster sizes. Version vectors are a related concept used in databases like Riak.

Write amplificationStorage Engines

The ratio of data written to storage vs. data logically written by the application. An LSM tree with compaction may write each byte of user data 10–30× due to repeated merging and rewriting of SSTables. High write amplification reduces SSD lifespan and limits write throughput.

Write-ahead log (WAL)Storage Engines

A technique for durability where every change is first written to an append-only log on disk before being applied to the main data structure. On crash recovery, the log is replayed to reconstruct the most recent state. Used by virtually every durable database (PostgreSQL, MySQL, SQLite).