Chapter 31: Storage Engine Internals — Principles of Distributed Systems Design

Why care about storage engines at all?

Most engineers interact with databases through queries. You write SELECT * FROM orders WHERE customer_id = 42 and data comes back. The engine is invisible. That's the point — good abstractions are invisible.

But the abstraction leaks the moment you have to make architectural decisions. Which database should we use? Can this system handle 100,000 writes per second? Why are our read latencies spiking even though CPU is fine? Why is disk usage growing faster than our data size?

These questions cannot be answered without understanding what the storage engine is doing underneath. Not the full implementation — but the core data structures and the trade-offs they encode.

There are really two dominant approaches to building a storage engine: B-trees and Log-Structured Merge trees (LSM-trees). PostgreSQL, MySQL, SQLite, and Oracle all use B-trees. Cassandra, RocksDB, LevelDB, HBase, and InfluxDB all use LSM-trees. Understanding why each was built tells you when to use each.

B-Trees: The 50-Year-Old Design That's Still Everywhere

B-trees were invented in 1970 by Rudolf Bayer and Ed McCreight. They've been the dominant storage engine design ever since. That's not inertia — it's because the design is genuinely excellent for a wide class of workloads.

The core idea: pages

A B-tree stores data in fixed-size pages, typically 4KB or 16KB. Each page is a self-contained unit that can be read from or written to disk independently. This matches how operating systems and SSDs work — they also operate in page-sized units. So one logical read from a B-tree is often one physical disk read.

Pages are organized in a tree. At the top is the root page. Below it are branch pages, and at the bottom are leaf pages which contain the actual data (or pointers to it). Every page contains a range of keys and pointers to child pages.

              ┌─────────────────────────────┐
              │         Root Page            │
              │  keys: [100]  [500]  [900]   │
              └──────┬────────┬────────┬─────┘
                     │        │        │
           ┌─────────┘  ┌─────┘  └────────────┐
           ▼            ▼                     ▼
    ┌──────────┐  ┌──────────┐           ┌──────────┐
    │  Branch  │  │  Branch  │    ...    │  Branch  │
    │ [10,50,  │  │[200,350, │           │[600,700, │
    │  80,99]  │  │ 420,490] │           │ 800,850] │
    └────┬─────┘  └────┬─────┘           └────┬─────┘
         │              │                      │
         ▼              ▼                      ▼
    ┌──────────┐  ┌──────────┐           ┌──────────┐
    │  Leaf    │  │  Leaf    │    ...    │  Leaf    │
    │ (actual  │  │ (actual  │           │ (actual  │
    │  data)   │  │  data)   │           │  data)   │
    └──────────┘  └──────────┘           └──────────┘

  Lookup for key 350: read root → read branch → read leaf = 3 disk reads
  For a tree with branching factor 500 and 3 levels: can hold 500³ = 125 million keys

To look up a key, you start at the root, compare your key against the keys in the page, follow the right pointer down, and repeat until you reach a leaf page. For a tree with a branching factor of 500 (each page has up to 500 child pointers), a 3-level tree can hold up to 125 million keys. A 4-level tree: 62 billion. And all of them are found in at most 4 disk reads.

How writes work in a B-tree

When you write a key, you find the leaf page where it belongs and modify that page in place. Overwrite the old value with the new one. Write the page back to disk.

This is the core characteristic of B-trees: in-place updates. The data lives in one place and you go update it directly.

But there's a problem: what if the leaf page is full? You need to split it into two pages and update the parent to point to both. If the parent is also full, you split that too — and so on, possibly all the way to the root. Most writes are simple single-page updates, but the worst case is a cascade of splits.

There's another problem: what if the system crashes midway through a write? You've partially updated a page. The database is now corrupted. This is solved with a Write-Ahead Log (WAL) — before touching any page, you write your intended change to an append-only log file. If the system crashes, you replay the log on startup. The WAL is itself an append-only file, so writes to it are sequential and fast.

Key Insight

Every B-tree write actually causes at least two writes: one to the WAL, one to the tree page itself. For updates that cause page splits, it could be many more. This is called write amplification — one logical write turns into multiple physical writes. We'll come back to this concept in detail.

Why B-trees are fast for reads

Because data lives in exactly one place. To find a key, you do a tree traversal — a small, bounded number of page reads. The data is where you expect it to be. Range scans (find all users aged 25–30) are efficient too, because keys are stored in sorted order within pages and pages at the leaf level are often linked as a doubly-linked list.

The predictability of B-tree reads is a huge practical advantage. A lookup always takes O(log n) with a very small constant. There are no "slow paths." This is why PostgreSQL and MySQL feel reliable under mixed workloads — the read performance is deterministic.

LSM-Trees: What Happens When You Stop Updating In Place

In 1996, Patrick O'Neil and colleagues published a paper proposing a completely different approach: what if you never overwrote data? What if every write was an append?

The intuition: sequential writes to disk are dramatically faster than random writes. On a spinning disk, random writes require the disk head to physically move — this is slow. Sequential writes just stream data one byte after another. Even on SSDs, sequential writes are faster because they avoid the write amplification baked into flash memory's erase-before-write cycle.

The LSM-tree builds on this insight by making all writes sequential. Here's how:

The write path: MemTable → SSTable

When you write a key-value pair, it goes into an in-memory sorted data structure called a MemTable (usually a red-black tree or skip list). Writes here are extremely fast — it's just a memory operation.

The MemTable also writes to a WAL (same idea as B-trees, for crash recovery). But the WAL is cheap here because it's purely sequential.

When the MemTable gets big enough (typically 64MB–256MB), it's flushed to disk as an SSTable — a Sorted String Table. An SSTable is an immutable, sorted, compact file. Once written, it is never modified. This flush is a large sequential write — fast.

Write path:

  Write(k, v)
      │
      ├─► WAL (append-only log, for crash recovery)
      │
      └─► MemTable (in-memory sorted tree)
               │
               │  (when full, typically 64MB–256MB)
               ▼
          Flush to disk
               │
               ▼
        SSTable L0  (immutable, sorted file on disk)
        SSTable L0  (another one, slightly older)
        SSTable L0  (older still...)

               │
               │  (background compaction)
               ▼
        SSTable L1  (merged, larger, still sorted)
        SSTable L2  (merged again, even larger)
        SSTable L3  ...

The read path: check multiple places

Reading from an LSM-tree is more complex than a B-tree. When you look up a key, you need to check:

The current MemTable (most recent writes)
Any immutable MemTables being flushed
L0 SSTables (newest on disk)
L1 SSTables
L2, L3... and so on down the levels

You stop as soon as you find the key (at the first, most-recent level that has it). But in the worst case, you check every level before concluding the key doesn't exist.

This is the core read disadvantage of LSM-trees compared to B-trees. The data isn't in one predictable place — it might be in any of several files across multiple levels.

The solution to expensive reads is bloom filters. Each SSTable has an associated bloom filter — a probabilistic data structure that can tell you with certainty "this key is definitely NOT in this file" (no false negatives) but occasionally says "this key might be in this file" when it isn't (small false positive rate). With bloom filters, most lookups skip irrelevant SSTables entirely. This makes point lookups fast in practice, even though the theoretical worst case is slow.

Real World

RocksDB (Facebook's LSM-tree engine, forked from Google's LevelDB) uses bloom filters aggressively. A point lookup in RocksDB with bloom filters enabled typically reads from only 1–2 SSTables, not every level. The filter is stored in the SSTable's block cache and is very small relative to the data it covers — a few bits per key.

The delete problem

You can't delete a key by removing it — the SSTable is immutable. Instead, you write a special record called a tombstone. When you read and encounter a tombstone, you know the key was deleted. The actual data gets cleaned up during compaction.

This creates a subtle but important behavior: if you delete a key and immediately try to read it back, you'll correctly see it as deleted. But the space it occupied on disk isn't freed until compaction runs. On a heavily write-then-delete workload, you can accumulate a lot of dead data before compaction reclaims it.

Compaction: The Engine of LSM-Tree Health

Compaction is the background process that merges SSTables together. It is not optional cleanup — it is what makes LSM-trees work. Without it, you'd accumulate thousands of SSTables, reads would require checking all of them, and your disk would fill with obsolete data and tombstones.

The basic idea: take several SSTables from a level, merge-sort them (they're already sorted, so this is fast), write the result as a new SSTable to the next level, then delete the input files. Because you're reading from and writing to different files, this doesn't interrupt ongoing reads and writes — it happens truly in the background.

Compaction strategies

There are two main compaction strategies, and choosing between them matters for your workload:

Size-tiered compaction (Cassandra's default): When you accumulate N SSTables of roughly the same size, merge them into one larger SSTable. Simple, great for write-heavy workloads. The downside: during compaction, you temporarily have the old small files and the new large file on disk at the same time. This doubles the disk usage during compaction and means space amplification can be high.

Leveled compaction (LevelDB and RocksDB's default): Keep data in levels where each level is roughly 10x larger than the previous. L1 might be 10MB, L2 100MB, L3 1GB. Each key can only exist in one SSTable per level (except L0). This means reads are much more predictable — you check at most one SSTable per level. But it requires more compaction work (higher write amplification) to maintain the invariant.

Strategy	Write Amplification	Read Performance	Space During Compaction	Best For
Size-tiered	Lower	Worse (more files to check)	Higher (2x during merge)	Write-heavy workloads, time-series
Leveled	Higher	Better (bounded per level)	Lower (10% overhead)	Mixed read-write, key-value lookups

When compaction falls behind — and why it's an emergency

Compaction runs in the background using CPU and disk I/O. If your write rate is high enough that new SSTables are created faster than compaction can merge them, you have a problem.

The failure mode is gradual but accelerating. First, read latency starts climbing as queries must check more and more SSTable files. Then, space usage spikes because dead keys and tombstones aren't being cleaned up. Then, at some point, RocksDB hits a threshold and stalls writes entirely — it refuses new writes until compaction catches up. Your service is now down.

This is not a hypothetical. It has caused production outages at multiple large companies. The insidious part is that it looks fine until it doesn't. Write throughput looks normal, latency looks normal, and then suddenly everything stops.

Production Risk

If you run an LSM-tree-based system (Cassandra, RocksDB, any system embedding RocksDB), you must monitor compaction pending bytes and L0 SSTable count. In RocksDB, rocksdb.estimate-pending-compaction-bytes is the number you want on a dashboard. A steadily growing value is your early warning signal — you have time to act. A sudden spike means you're already in trouble.

Compaction and the write cliff

There's a related phenomenon called the write cliff. When you first set up an LSM-tree system, writes are fast because you're filling an empty MemTable and flushing fresh SSTables. But as the database fills up, compaction has more work to do with each flush. Write throughput gradually drops as the system matures. Systems that benchmarked beautifully at 50GB can struggle at 500GB.

This is not a bug — it's the physics of the data structure. But if you do your capacity planning only with a freshly loaded database, you'll be unpleasantly surprised in production. Always benchmark with a realistic data volume.

The Storage Trilemma: Read, Write, Space Amplification

We've been talking about trade-offs informally. Let's make them precise. Every storage engine makes trade-offs across three dimensions:

✍️

Write Amplification

How many bytes are written to disk for each byte of data you write logically. A value of 10x means writing 1KB of data causes 10KB of disk writes.

👁

Read Amplification

How many disk reads are needed per logical read. A B-tree lookup touching 4 pages = 4x read amplification. More amplification means higher latency.

💾

Space Amplification

The ratio of physical disk space used to the logical data size. If you store 100GB of data and it uses 130GB on disk, that's 1.3x space amplification.

Mark Callaghan (a storage engine engineer at Facebook) formalized this as the RUM conjecture: you cannot minimize all three simultaneously. You can minimize two, but the third grows. This is not an engineering limitation — it's a fundamental property of data structures.

Engine	Write Amplification	Read Amplification	Space Amplification
B-tree (PostgreSQL)	High (WAL + page writes + splits)	Low (3–5 page reads, predictable)	Medium (~30% fragmentation)
LSM leveled (RocksDB)	Medium–High (compaction rewrites)	Low–Medium (bloom filters help a lot)	Low (~10% overhead with leveled compaction)
LSM size-tiered (Cassandra)	Low (less compaction work)	Higher (more files to check)	High (2x during compaction)

The practical takeaway: if your workload is read-heavy and you need predictable latency, a B-tree is the safe, conservative choice. If your workload is write-heavy — think IoT telemetry, event logs, user activity streams — an LSM-tree can handle 10x more writes on the same hardware.

Column-Oriented Storage: When Rows Are the Wrong Shape

Everything we've talked about so far — B-trees, LSM-trees — stores data in rows. All the columns for a single row are stored together. This makes sense for most transactional workloads: when you load a customer profile, you need all their fields (name, email, address, preferences) at once. Fetching one row is one operation.

But consider an analytical query: "What was the total revenue from mobile users in Q1, broken down by country?"

A table with 500 million order records might have 50 columns — order ID, customer ID, product ID, price, quantity, timestamp, device type, country, payment method, shipping address, and 40 more. To answer this question, you only need 4 of those columns: price, quantity, device_type, country. But in a row-oriented store, you have to read all 50 columns to get to those 4. You're paying for 46 columns of I/O you don't need.

With 500 million rows and ~200 bytes per row, that's reading 100GB of data to answer a question whose answer lives in ~16GB. That's the problem column stores solve.

How column storage works

Instead of storing all columns of each row together, a column store keeps each column in its own file (or set of files). All the price values live together. All the country values live together. To answer the revenue query, you read only the price, quantity, device_type, and country files — skipping the other 46 entirely.

Row-oriented storage (one row = one record):

Row 1: [order_id=1] [customer_id=42] [price=29.99] [country=US] [device=mobile] [...]
Row 2: [order_id=2] [customer_id=17] [price=14.50] [country=UK] [device=web]    [...]
Row 3: [order_id=3] [customer_id=99] [price=89.00] [country=US] [device=mobile] [...]

→ To get all prices: must read every row, skip 46 other columns each time


Column-oriented storage (one column = one file):

prices.col:   [29.99] [14.50] [89.00] [...]   ← read only this for revenue
country.col:  [US]    [UK]    [US]    [...]   ← read only this for filtering
device.col:   [mobile][web]   [mobile][...]   ← read only this for device filter
customer.col: [42]    [17]    [99]    [...]   ← not needed, not read at all
order_id.col: [1]     [2]     [3]     [...]   ← not needed, not read at all

→ Query reads only 4 files instead of 50 columns × 500M rows

Compression: the free lunch of column storage

Column stores compress dramatically better than row stores. Why? Because a column contains values of the same type, and in practice the same few values repeat constantly. A country column might have 200 distinct values across 500 million rows. A device_type column might have 5 values.

This is perfect for dictionary encoding: instead of storing "United States" 100 million times, you store the number 1 (which maps to "United States" in a dictionary). Or run-length encoding: instead of [US, US, US, US, US, UK, UK, UK], store [(US, 5), (UK, 3)]. These compressions are not possible in row storage because rows mix different data types.

In practice, columnar formats like Parquet or ORC often achieve 5–10x compression compared to raw row storage. For a 100TB dataset, that's the difference between 100TB and 10–20TB of storage. At cloud storage prices, this is not a trivial saving.

Vectorized execution: making CPUs go fast

There's another advantage that's less obvious. Modern CPUs have SIMD instructions — Single Instruction, Multiple Data. A single CPU instruction can add 8 integers simultaneously. But this only works if those integers are laid out sequentially in memory.

With column storage, all the price values are sequential in memory. A "sum all prices" query can use SIMD to process 8 prices in the time it takes to process 1. With row storage, prices are scattered across rows with other columns in between — SIMD can't help.

This is why analytics databases like ClickHouse, DuckDB, and BigQuery can scan billions of rows per second per CPU core. It's not magic — it's column layout enabling vectorized CPU execution.

The write problem with column storage

Column stores are not good for transactional writes. When you insert a row, you have to write to every column file — 50 small writes scattered across 50 files instead of 1 write to 1 location. This is the inverse of the read advantage. It makes column stores terrible for high-frequency OLTP workloads where you're inserting or updating individual rows constantly.

This is why we have two categories of databases that rarely overlap: OLTP (Online Transaction Processing — lots of small reads and writes, row store) and OLAP (Online Analytical Processing — large analytical scans, column store). Most large companies run both and move data between them via ETL pipelines or change data capture.

Real World

This separation is why Airbnb (and most companies) run MySQL or PostgreSQL for transactional data and then replicate into BigQuery or Redshift for analytics. They are solving genuinely different problems and the storage engine requirements are genuinely incompatible for extreme workloads. Hybrid HTAP databases (TiDB, SingleStore) try to bridge this — but they come with their own complexity and trade-offs.

Practical Decision Guide: Which Storage Engine and When

With all of this context, here is a direct framework for making the decision:

Use a B-tree engine (PostgreSQL, MySQL InnoDB) when:

Your workload is mixed read-write with reads dominant
You need predictable, low read latency (B-tree read times are very consistent)
You need strong transactional guarantees (ACID, MVCC, foreign keys)
Your data model has complex relationships that benefit from SQL joins
You're writing a new service and aren't sure of the access patterns yet — B-trees are forgiving

Use an LSM-tree engine (RocksDB, Cassandra, ScyllaDB) when:

Writes dominate your workload — event ingestion, logging, IoT telemetry, activity feeds
You need to sustain very high write throughput (100K+ writes/sec per node)
Your access pattern is primarily key-based (get by user ID, get by event ID) with few complex queries
You can tolerate eventual consistency or are willing to engineer around it
Your data is append-heavy and you rarely update existing records in place

Use column-oriented storage (Parquet + query engine, ClickHouse, BigQuery, Redshift) when:

Your queries scan large portions of the dataset (aggregations, GROUP BY, filters on high-cardinality columns)
You have wide tables (many columns) and queries only touch a few of them
Your data is immutable or append-only (events, logs, analytics snapshots)
Batch or near-real-time analytics, not sub-millisecond transactional queries
You're working with time-series data at scale

Common Mistake

Teams often start with a row-oriented database for convenience, grow to where analytics queries take minutes instead of seconds, and then try to add columnar indexes or materialized views to "fix" it. This usually works for a while and then hits a wall. Column-oriented storage isn't an optimization of row storage — it's a fundamentally different data layout. If your primary use case is analytics at scale, choose a column store from the start.

RocksDB: The Engine Inside the Engines

A brief but important note on RocksDB, because it appears everywhere and its behavior propagates into every system that uses it.

RocksDB is an embeddable LSM-tree storage engine originally developed at Facebook (forked from Google's LevelDB). It's not a database — it's a library that databases embed. The list of systems that use RocksDB as their storage layer includes: TiKV (the storage layer of TiDB), MyRocks (MySQL with RocksDB instead of InnoDB), CockroachDB, YugabyteDB, Apache Kafka (log compaction), LinkedIn's Espresso, and dozens more.

Understanding RocksDB means understanding the behavior of all of these systems. When CockroachDB has a compaction stall, it's a RocksDB compaction stall. When TiKV shows high write amplification, it's RocksDB's leveled compaction running. The abstraction doesn't hide the engine's characteristics — it inherits them.

RocksDB is also famously tuneable — it has hundreds of configuration knobs. The default settings are conservative and may not match your workload. For any system embedding RocksDB that you run at scale, it's worth understanding at least: block_cache_size (how much memory for the block cache), max_write_buffer_count (memtable size), level0_slowdown_writes_trigger (when writes start slowing), and compression_type per level (significant cost/storage trade-off).

The Migration Problem: Why This Decision Is Hard to Change

Let's end with the most practically important point. You can change almost any decision in a software system without a major production incident. You can rewrite services, swap libraries, refactor schemas (carefully). But changing a storage engine with data already in it is genuinely painful.

The reason: you typically cannot read data in the old engine's format with the new engine. You need to migrate every record. For a system with 10TB of data, that migration takes hours and requires careful dual-write logic, validation, and a cutover. For 100TB, it's a multi-week project. And you have to do it while serving live traffic.

This is why the storage engine choice deserves the same weight as "which cloud provider do we use." Both are foundational and both are expensive to change. Unlike many architectural decisions where you can defer and decide with more information — the "design late, decide with more data" advice from earlier in this book — the storage engine decision should be made carefully upfront, because the cost of reversing it grows linearly with your data volume.

The Reversibility Test

Before picking a storage engine, ask: "At 100x our current data volume, would this engine still be the right choice?" If the answer is "probably not," either choose the engine that works at 100x now, or plan explicitly for the migration when you'll need it. Don't discover the need for migration while you're already in a performance crisis.

Storage Engine Internals
(What You Need to Know)

What's in this chapter

Key Learnings — Read this if you're short on time