Time Is a Lie in Distributed Systems
Every computer has a clock. Every clock drifts. And in a distributed system, two events that appear to happen at the "same time" may have nothing to do with each other — or everything to do with each other. The language we use to talk about time in distributed systems must be precise, because the wrong intuition here causes bugs that are nearly impossible to reproduce.
What's coming in this chapter
- Why you cannot trust physical clocks for event ordering, even with NTP synchronization
- What Lamport timestamps actually give you — and the crucial thing they don't
- How vector clocks capture causality and why that matters more than wall-clock order
- Hybrid Logical Clocks: the practical middle ground most databases use today
- Google Spanner's TrueTime and what it costs to build a globally consistent database on physical time
- A decision guide: which clock mechanism to use for which problem
Key Learnings — Read This If You're Short on Time
The essential takeaways from this chapter at a glance
Physical clocks lie. Every clock drifts — typically 10–200ms per day. NTP corrects this, but never perfectly, and corrections can cause the clock to jump backwards. Never use System.currentTimeMillis() to order events across machines.
The only reliable ordering is causality. "A happened before B" only means something if A could have influenced B — a message was sent, a variable was read, a file was written. Wall-clock order across machines is meaningless without additional coordination.
Lamport timestamps give you a total order, but not causality. If A→B then L(A) < L(B). But the reverse is not true — if L(A) < L(B), you cannot conclude A caused B. This is the most common misconception about Lamport clocks.
Vector clocks give you causality, but are expensive. A vector clock tells you definitively whether two events are causally related or concurrent. The cost is a vector of size N (one entry per node) that grows with your cluster.
Hybrid Logical Clocks (HLC) are the practical choice. They combine physical time (so timestamps look like real time to humans) with a logical counter (so they're monotonic and capture causality). CockroachDB, YugabyteDB, and MongoDB all use HLCs.
TrueTime is extraordinary — and expensive. Google builds atomic clocks and GPS receivers into every datacenter to bound clock uncertainty to <7ms. Then Spanner deliberately waits out that uncertainty before committing. You cannot replicate this without Google-scale infrastructure.
The practical rule: use logical clocks for ordering, physical clocks only for human-readable timestamps and TTLs. Never use physical time to determine which of two concurrent operations "wins."
The Problem With Clocks
Here's a scenario that seems simple. You have two servers, A and B. A user updates their profile on server A at 10:00:00.500, and the same user updates it on server B at 10:00:00.600. Which update should win?
The obvious answer is "the later one — the one from B." But here's the problem: how do you know it was later? You're comparing timestamps from two different clocks. Those clocks have never been perfectly synchronized. One of them might be running 200 milliseconds ahead of the other. The "later" timestamp might actually correspond to the earlier event in real time.
This is not a theoretical problem. It has caused real data loss in production systems at companies you've heard of.
Time in distributed systems is hard for three reasons:
First, clocks drift. A physical clock is a crystal oscillator that vibrates at a known frequency. But crystals are not perfect. Temperature, age, and manufacturing variation all cause them to run slightly fast or slow. A typical server clock drifts 10 to 200 milliseconds per day. That sounds small, but in a system processing thousands of events per second, 200 milliseconds of drift means thousands of events are ordered incorrectly.
Second, clock correction creates discontinuities. We use NTP (Network Time Protocol) to periodically correct clocks. But NTP correction doesn't smoothly nudge the clock — it can jump it forward or backward. If your clock jumps backward by 100ms while you're assigning timestamps to events, you'll generate timestamps that go back in time. Anything relying on "monotonically increasing timestamps" is now broken.
Third, there is no "global now." Special relativity tells us that simultaneity is relative — two events that are simultaneous in one reference frame are not necessarily simultaneous in another. For practical purposes, the speed of light across a datacenter is measurable (roughly 1ms to travel 300km). Two events happening "at the same time" in San Francisco and New York are fundamentally impossible to verify — by the time you compare the timestamps, the comparison itself took time.
In a distributed system, a timestamp is an assertion made by the machine that generated it, using a clock that is only approximately correct. You are trusting that machine's clock, and that trust is often misplaced.
Physical Clocks and What Goes Wrong
Modern computers have two clocks. The first is a time-of-day clock (often called a wall clock). This is what you get when you call System.currentTimeMillis() or datetime.now(). It gives you a point in time — a Unix timestamp, milliseconds since 1970. It is subject to NTP corrections, which means it can jump forwards or backwards.
The second is a monotonic clock. This is what you get when you call System.nanoTime() in Java, or time.monotonic() in Python. A monotonic clock never goes backwards. It counts elapsed time since some arbitrary point (usually system boot). It is not subject to NTP adjustments. But it has no meaning across machines — two machines' monotonic clocks start at different boot times and have no relationship to each other.
Wall clock: gives you a real-world timestamp. Subject to NTP corrections — can go backwards. Use for: human-readable logs, TTL expiry, displaying times to users. Never use for: ordering events across machines.
Monotonic clock: always increases. Meaningful only within a single process on a single machine. Use for: measuring elapsed time within one process (e.g., request latency, timeouts). Never use for: comparing times across machines.
How NTP Works — and Where It Fails
NTP works by having your machine ask a time server: "what time is it?" The server sends back its current time, and your machine adjusts its clock accordingly. But this is complicated by the fact that the request took some time to travel to the server, and the response took some time to come back. NTP tries to account for network latency in its calculation, but it cannot do so perfectly — especially when the network is slow or asymmetric.
In a well-managed datacenter with a good NTP setup, you can get clock synchronization to within 1–10 milliseconds across machines. In a cloud environment, it's typically 5–50 milliseconds. Across datacenters in different regions, it can be 50–300 milliseconds. Across a poorly configured fleet, it can be seconds.
Machine A NTP Server
| |
| ── request ─────────────────>| t1: request sent
| | t2: server receives request
| | t3: server sends response
| <───────────── response ─────| t4: response received
| |
One-way latency estimate = ((t4 - t1) - (t3 - t2)) / 2
Clock offset estimate = ((t2 - t1) + (t3 - t4)) / 2
Problem: this assumes symmetric network latency.
In practice, the request and response may take different paths.
There's a subtler failure mode: even if NTP synchronizes your clock correctly right now, your clock will drift again before the next NTP sync. Modern NTP implementations sync every few seconds to a few minutes. Between syncs, you're on your own. During a network partition when your machine can't reach its NTP server, your clock will drift unchecked for the duration of the partition.
A common production bug: two database replicas both accept writes during a split-brain scenario. When the partition heals, the system uses timestamps to determine which writes win. But if one replica's clock was drifted ahead by 200ms, the "newer" timestamp belongs to the stale data. Real writes from users are silently overwritten by older data with a higher timestamp.
This is why "last-write-wins" based on wall-clock timestamps is dangerous. The write that "wins" might not be the one that happened last in real life.
Logical Clocks — Ordering Without a Wall Clock
In 1978, Leslie Lamport published a paper called "Time, Clocks, and the Ordering of Events in a Distributed System." It's one of the most cited papers in computer science. The key insight was this: in a distributed system, you don't need to know the absolute time of an event. You only need to know the relative order of events — specifically, which events could have influenced which other events.
Lamport introduced the happens-before relation, written as A → B (read: "A happens before B"). A happens before B if any of the following are true:
1. A and B occur on the same process, and A occurs before B in that process's execution.
2. A is the sending of a message and B is the receiving of that same message.
3. There exists some event C such that A → C and C → B (transitivity).
If neither A → B nor B → A, then A and B are concurrent — they happened independently and could not have influenced each other.
Causality, not clock time, is what matters for ordering in distributed systems. "A happened before B" only means something if A could have caused B. If they're concurrent — if no message passed between them — then their relative order is undefined, and it doesn't matter which one you say happened first.
Lamport Timestamps
Lamport timestamps give each event a number that is consistent with the happens-before relation. The rules are simple:
Each process maintains a counter, starting at 0. Every time an event occurs on a process, the process increments its counter by 1 and assigns that number as the event's timestamp. Every time a process sends a message, it includes its current counter value in the message. Every time a process receives a message, it sets its counter to max(local_counter, received_counter) + 1.
Process A Process B Process C
L=1: event a1
L=2: send(B) ──────────────────────> L=3: recv from A
L=4: event b1
L=5: send(C) ───────────────> L=6: recv from B
L=3: event a2 L=7: event c1
The guarantee that Lamport timestamps provide is: if A → B, then L(A) < L(B). If A causally precedes B, A's Lamport timestamp will be strictly less than B's.
But here's what most people get wrong: the converse is not true. If L(A) < L(B), you cannot conclude that A → B. It might be true. Or A and B might be concurrent events that happen to have been assigned different numbers. Lamport timestamps can tell you "A did not happen after B." They cannot tell you "A happened before B" with certainty.
Where Lamport Timestamps Are Used
Lamport timestamps are used in distributed databases to assign a total order to transactions. If two transactions have the same timestamp (rare but possible), a tiebreaker rule (like the node ID) is applied. This gives you a consistent total order across the whole system without requiring a central clock.
They're also used in distributed mutual exclusion algorithms, where processes need to agree on whose turn it is to access a shared resource, and in leader election algorithms.
The limitation is that Lamport timestamps don't let you detect concurrent events. You can't tell from two Lamport timestamps alone whether the events were causally related or happened independently. For that, you need vector clocks.
Vector Clocks — Tracking Causality Precisely
A vector clock is an extension of the Lamport clock idea. Instead of a single counter, each process maintains a vector of counters — one per process in the system. This small change gives you something Lamport timestamps cannot: the ability to detect whether two events are causally related or concurrent.
The rules are similar to Lamport timestamps, but applied per-process in the vector:
Each process i maintains a vector V where V[j] represents the number of events process i knows about from process j. On every local event, process i increments V[i]. When sending a message, process i includes its current vector. When receiving a message with vector W, process i sets V[j] = max(V[j], W[j]) for all j, then increments V[i].
Process A Process B Process C
[A=1,B=0,C=0]: a1
[A=2,B=0,C=0]: send(B) ──> [A=2,B=1,C=0]: recv
[A=2,B=2,C=0]: b1
[A=2,B=3,C=0]: send(C) ──> [A=2,B=3,C=1]: recv
[A=3,B=0,C=0]: a2 [A=2,B=3,C=2]: c1
Comparing Vector Clocks
Given two vector clocks V and W, you can determine their relationship precisely:
V = W: All entries are equal. The events are the same (or identical timestamps, which shouldn't happen if implemented correctly).
V < W (V happened before W): Every entry in V is ≤ the corresponding entry in W, and at least one entry is strictly less. Event V causally precedes event W.
V > W: The reverse — W causally precedes V.
Neither V ≤ W nor W ≤ V: The events are concurrent. Neither could have caused the other. This is the case Lamport timestamps cannot distinguish.
-- Are these two vector clocks concurrent?
def concurrent(v, w):
v_less = any(v[i] < w[i] for i in range(len(v)))
w_less = any(w[i] < v[i] for i in range(len(v)))
return v_less and w_less # both have entries where the other is "ahead"
-- Example: A=[2,1,0], B=[1,2,0] → concurrent (A has 2>1 in slot 0, B has 2>1 in slot 1)
-- Example: A=[1,1,0], B=[2,2,0] → A happened before B (all entries of A ≤ B)
Real-World Use: Amazon Dynamo
Amazon's Dynamo paper (2007) is the canonical real-world use of vector clocks. Dynamo is a key-value store designed for high availability — it accepts writes even during network partitions, which means the same key can be updated on two different nodes at the same time, creating conflicting versions.
When a conflict is detected (two versions with concurrent vector clocks), Dynamo surfaces both versions to the application and asks it to reconcile. A shopping cart, for example, merges the two carts by taking the union of their items. This is a deliberate design choice: the application knows the right merge strategy for its data type; the database does not.
Vector clocks are the right tool when you need to detect concurrent writes to the same piece of data, and you have an application-level merge strategy. If you don't have a merge strategy (or "merge by taking the latest" is acceptable), you can often get away with something simpler.
The Cost of Vector Clocks
The downside of vector clocks is size. Each vector clock has one entry per node in the cluster. For a 3-node cluster, this is trivial — three integers. For a 1,000-node cluster, every event carries a vector of 1,000 integers. This overhead is attached to every message and every stored value.
In practice, most systems that use vector clocks either keep the cluster size small, use version vectors that only track the nodes that have actually touched a particular piece of data, or use a different approach altogether — which brings us to Hybrid Logical Clocks.
Hybrid Logical Clocks (HLC)
Hybrid Logical Clocks were introduced by Sandeep Kulkarni, Murat Demirbas, Deepak Madisetti, and Bharadwaj Avva in 2014. The idea is elegant: combine a physical clock (so events have real-world timestamps that humans can read) with a logical counter (so the clock is monotonic and captures causality). You get the benefits of both and the worst drawbacks of neither.
An HLC timestamp is a pair (physical_time, logical_counter). The rules:
On a local event: if the wall clock is ahead of the current HLC, use the wall clock time and reset the counter to 0. Otherwise, keep the same physical time and increment the counter.
On receiving a message with HLC (pt, lc): set the physical time to max(local_wall_clock, pt, current_hlc_physical), then set the counter to either 0 (if the physical time was advanced by the wall clock) or max(local_counter, lc) + 1 (if the physical time stayed the same).
Node A: wall=100ms
event a1 → HLC = (100, 0)
event a2 → HLC = (101, 0) ← wall clock advanced
send msg to B, include HLC=(101, 0)
Node B: wall=99ms (drifted behind)
recv msg from A, HLC=(101, 0)
→ set physical = max(99, 101) = 101
→ counter = 0 + 1 = 1
event b1 → HLC = (101, 1) ← physical time from A, counter incremented
event b2 → HLC = (102, 0) ← wall clock caught up
Why HLC Is the Practical Choice
HLC timestamps look like real timestamps to a human — you can read them and understand roughly when something happened. They are monotonic — they never go backwards. They capture causality — if A → B, then HLC(A) < HLC(B). And the timestamp size is constant — it's always just one physical timestamp and one small integer counter, regardless of cluster size.
This is why HLC has been adopted widely in production distributed databases:
| System | Clock Mechanism | Notes |
|---|---|---|
| CockroachDB | Hybrid Logical Clocks | HLC used for MVCC transaction timestamps |
| YugabyteDB | Hybrid Logical Clocks | Derived from CockroachDB's approach |
| MongoDB | Hybrid Logical Clocks | Added in MongoDB 3.6 for causal consistency sessions |
| Apache Cassandra | Wall clock + tiebreaker | Uses wall clock with last-write-wins — simpler but more vulnerable to clock skew |
| Google Spanner | TrueTime (bounded physical) | Purpose-built atomic clocks; extraordinary infrastructure cost |
HLC guarantees that the physical component of an HLC timestamp never drifts more than the maximum clock skew across the cluster from real wall time. If you configure NTP well and your clock skew is bounded to, say, 250ms, then HLC timestamps are always within 250ms of real time. For most applications, this is more than good enough.
Google Spanner and TrueTime
Google Spanner is a globally distributed relational database that provides external consistency — the strongest possible consistency guarantee. If transaction T1 commits before transaction T2 starts (in real physical time), then T1's commit timestamp is smaller than T2's. This guarantee holds across the entire planet, across thousands of machines.
How? Google built TrueTime.
What TrueTime Is
TrueTime is a time API that, instead of returning a single timestamp, returns an interval: TT.now() = [earliest, latest]. The API guarantees that the true current time falls somewhere within this interval. The width of the interval — called the uncertainty window — is typically less than 7 milliseconds.
To achieve this, Google installs atomic clocks and GPS receivers in every datacenter. Atomic clocks are extremely accurate (drift of about 1 nanosecond per day) but expensive and require calibration. GPS receivers provide absolute time signal from satellites. Each datacenter has multiple time servers cross-checking each other. Every machine queries these local time masters rather than reaching out to a remote NTP server.
Every Google Datacenter:
┌─────────────────────────────────────┐
│ GPS receiver ──┐ │
│ ├──> Time Master ──────────> TrueTime API
│ Atomic clock ──┘ │ returns [t-ε, t+ε]
└─────────────────────────────────────┘
ε (epsilon) ≈ 1–7ms
Machines query local Time Master,
not remote NTP servers.
Result: bounded uncertainty of < 7ms.
The Commit Wait
Having a bounded uncertainty window is necessary but not sufficient. Spanner uses TrueTime in a clever way to achieve external consistency: commit wait.
When a transaction is ready to commit, Spanner:
1. Calls TT.now() and gets back [t_earliest, t_latest]. It assigns the commit timestamp as s = t_latest.
2. Waits until TT.now().earliest > s. That is, it waits until it's absolutely certain that the real current time is past the commit timestamp.
3. Only then does it release the lock and allow the transaction to be visible.
The wait is at most the uncertainty window — typically 1–7ms. After this wait, Spanner knows for certain that the commit timestamp s is in the past in real physical time. Any transaction that starts after this point will get a timestamp that is provably later.
Every write transaction in Spanner deliberately waits for the uncertainty window before committing. This adds 1–7ms latency to every write, not as a side effect but as a design decision. Spanner trades a small, bounded latency penalty for a very strong correctness guarantee. This is only acceptable because 7ms is small enough for most use cases, and because Google needs external consistency for its own products (Ads, Shopping, etc.) where stale reads could cause serious financial inconsistencies.
Can You Use TrueTime Without Google's Infrastructure?
Not exactly. But you can approximate it. CockroachDB, for example, uses HLC with a configurable maximum clock offset (default 500ms). Instead of waiting out an uncertainty window, CockroachDB uses that offset when transactions have potentially overlapping timestamps — it either restarts the transaction or adds a short wait. The guarantee is slightly weaker than Spanner's external consistency, but strong enough for most applications.
AWS Time Sync Service and Chrony with PPS (pulse-per-second) GPS can get you to sub-millisecond accuracy. Azure and GCP offer similar "precision time" services. These let you narrow the uncertainty window significantly — though not to Spanner's <7ms guarantee without additional hardware.
Practical Guidance — Which Clock for Which Problem
With all of this machinery explained, here's the decision guide that actually matters day-to-day.
| Problem | Right tool | Wrong tool |
|---|---|---|
| Display timestamps to users in a UI | Wall clock | — |
| TTL / cache expiry / session expiry | Wall clock (with tolerance for small skew) | — |
| Measure how long an operation took | Monotonic clock within one process | Wall clock (can jump) |
| Order events across multiple machines | Lamport timestamps or HLC | Wall clock (untrustworthy) |
| Detect concurrent writes to same key | Vector clocks | Lamport timestamps (can't detect concurrency) |
| Distributed database MVCC timestamps | HLC | Pure Lamport (no real-time meaning) |
| Global external consistency across regions | TrueTime (if you have the infrastructure) | NTP wall clock |
| "Last write wins" conflict resolution | HLC (with care) | Wall clock (clock skew = silent data loss) |
The Rules in Plain Language
If you're measuring time within one process — use a monotonic clock. It never goes backwards, it's accurate for elapsed time, and it's exactly what you need for timeouts and latency measurement.
If you're stamping events for human consumption — use a wall clock, but don't use those timestamps for ordering decisions. They're for humans to read.
If you need to order events across machines — use Lamport timestamps (for a total order) or HLC (for a total order with real-time proximity). Never use raw wall-clock timestamps across machines.
If you need to detect whether two events are concurrent — use vector clocks. This is the only mechanism that can definitively tell you "these two events happened independently."
If you're building a distributed database that needs MVCC — use HLC. It's what the production systems do.
Ask yourself: "If two machines' clocks disagree by 500ms right now, does my system give the wrong answer?" If yes, you're relying on physical clock synchronization for a correctness guarantee, and you should switch to a logical or hybrid clock mechanism. If no, you're probably fine with wall clocks for your current use case.
One More Thing: Fencing Tokens
There's a specific pattern worth calling out because it comes up whenever you use distributed locks. Say a process acquires a lock and then pauses — garbage collection, a long network call, something. By the time it resumes, the lock may have expired and another process may have taken over. The first process doesn't know this. It still thinks it holds the lock.
The solution is a fencing token — a monotonically increasing number that the lock service issues each time the lock is granted. Any write to the resource includes this token. The resource (a database, a file store) rejects writes with a token lower than the last seen token. So even if the old process resumes and tries to write, its stale token is rejected.
This is a logical clock in disguise — the lock service is acting as a Lamport clock for the specific resource. The fencing token is a Lamport timestamp issued by a single authority (the lock server), which makes it trustworthy in a way that distributed wall-clock timestamps are not.
In a distributed system, use physical clocks only for human-readable output and TTLs — for anything that requires correct ordering or conflict detection, use logical or hybrid clocks, because physical clocks lie in ways that are subtle, intermittent, and disastrous.
Using wall-clock timestamps to implement "last write wins" across distributed nodes, then discovering — only in production, only under load, only during an incident — that clock skew has been silently overwriting newer data with older data for months.
Three Questions for Your Next Design Review
- If this service assigns timestamps to events, are those timestamps from a wall clock? If so, what happens if two machines' clocks differ by 500ms — do we get the wrong answer?
- If two clients update the same record simultaneously from different nodes, how does the system determine which update wins? Does that mechanism depend on clock synchronization, and have we bounded our maximum clock skew?
- If we're using timestamps for ordering in a distributed queue or log, are we using a logical clock (Lamport, HLC) or a physical clock? Can we demonstrate that the ordering is correct even if NTP is unavailable for 5 minutes?