Chapter 6 — Partitioning | Principles of Distributed Systems Design

Why Partitioning Exists

Imagine you have a table with ten billion rows. Even on the fastest machine money can buy, there is a limit to how fast you can write to it, how much of it you can keep in memory, and how quickly you can scan through it. At some point, a single machine is simply not enough.

You could try to solve this with replication — make multiple copies of the data and spread reads across them. That helps with reads. But every copy still needs to accept every write. The write bottleneck and the storage limit remain. You've added redundancy, not capacity.

Partitioning (also called sharding) takes a different approach. Instead of putting all the data on one machine and making copies, you split the data. Each partition holds a subset of the data, and different partitions live on different machines. Now each machine handles a fraction of the writes and stores a fraction of the total data. If you add more machines, you can handle more load.

This sounds simple. It isn't. The moment you split data across machines, a question arises that you can't avoid: when a query comes in, which machine does it go to? And behind that question lurks a hundred more — about how you split the data, how you route requests, how you handle queries that need data from multiple machines, what happens when a machine goes down, and what happens when you need to add more machines later.

The answers to all these questions flow from one decision made early in the design: the partition key. Let's understand the strategies available to you first, and then talk about how to choose.

Terminology note In this chapter, "partition" and "shard" mean the same thing — a subset of the full dataset that lives on one node. Different systems use different words: Cassandra and MongoDB say "shard", Kafka says "partition", Elasticsearch says "shard", HBase says "region". The concept is identical.

Strategy 1: Range Partitioning

The most intuitive way to partition data is by range. You sort your data by some key, then divide it into consecutive ranges — like alphabetical sections in an encyclopedia.

  All records sorted by last_name:

  Partition 1    Partition 2    Partition 3    Partition 4
  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐
  │ A ... F   │  │ G ... M   │  │ N ... S   │  │ T ... Z   │
  │           │  │           │  │           │  │           │
  │  Node 1   │  │  Node 2   │  │  Node 3   │  │  Node 4   │
  └───────────┘  └───────────┘  └───────────┘  └───────────┘

Simple range partitioning on last name. Node 2 holds G through M.

The big advantage is that range scans are cheap. If you want all users whose last name starts with "H", you go to exactly one partition. If you want all orders placed between January 1st and January 31st (partitioned by date), you go to exactly the partitions that cover that month.

HBase, BigTable, and early versions of DynamoDB all use range partitioning. Time-series databases use it constantly — the natural partition key is time, and you almost always query a time range.

The hot spot problem in range partitioning

Now consider what happens if you partition an IoT sensor database by timestamp, and every sensor in the world is sending data right now. All the writes go to the partition covering "today" — a single node handles all the current writes while every other node sits idle.

Classic Mistake Using a timestamp as a partition key when you have high write throughput. Every write is a recent event. Every write goes to the newest partition. You get a hot spot so severe it defeats the entire purpose of partitioning.

This is the fundamental problem with range partitioning: the range that is currently "hot" — whether that's today's data, the most popular product category, or the most active user segment — will concentrate load on a single partition. The other partitions sit underutilised.

Mitigating range hot spots

There are a few ways to push back against this:

Prefix the key with something else. Instead of timestamp, use sensor_id + timestamp. Now writes spread across sensors, and you can still do range scans per sensor. The tradeoff: range scans across all sensors require hitting all partitions.

Pre-split aggressively. Create more partitions than you need now, so you have room to move hot partitions to less-loaded nodes. Cassandra calls these "virtual nodes" and creates 256 by default per physical node for exactly this reason.

Accept the trade-off consciously. If range scans are critical and write hot spots are manageable (low total write volume, or concentrated writes are expected and short-lived), range partitioning is still the right choice.

Strategy 2: Hash Partitioning

Instead of assigning a key to a partition based on its value, you run the key through a hash function and assign it based on the hash. Because good hash functions distribute their outputs uniformly, the keys land evenly across partitions — regardless of what the keys look like.

  Key → hash(key) → pick partition

  hash("alice")   = 7823  →  7823 % 4  = 3  → Partition 3
  hash("bob")     = 1204  →  1204 % 4  = 0  → Partition 0
  hash("charlie") = 3901  →  3901 % 4  = 1  → Partition 1
  hash("diana")   = 5547  →  5547 % 4  = 3  → Partition 3
  hash("evan")    = 2890  →  2890 % 4  = 2  → Partition 2

  Result: roughly even distribution across 4 partitions

Hash partitioning with modulo. Each key maps to exactly one partition.

Hash partitioning solves the sequential hot spot problem. Today's timestamp hashes to a different value than yesterday's. A famous user hashes to one partition, not every write to their data lands on the same machine. The load spreads out.

What you give up: range scans

Adjacent keys no longer live in the same partition. "alice" and "alicia" are neighbors in alphabetical order, but they hash to completely different values and live on different partitions. A query for "all users whose name starts with ali" must hit every partition and combine the results. This is called a scatter-gather query and it is expensive — proportional to the number of partitions, not the number of results.

This is not a fixable bug. It is the fundamental consequence of trading sort order for uniform distribution. DynamoDB, Cassandra (for the partition key), and most sharded MongoDB deployments make this trade.

The naive hash problem: reshuffling everything when you add nodes

If you have 4 partitions and you hash with hash(key) % 4, everything works until you need a 5th partition. Now you hash with % 5, and almost every key maps to a different partition. You have to move almost all the data. During a migration, this is crippling — you're moving data while still serving traffic.

  Before (4 nodes):   hash(key) % 4
  After  (5 nodes):   hash(key) % 5

  Key "alice" (hash=7823):
    Before: 7823 % 4 = 3
    After:  7823 % 5 = 3  ← same! (lucky)

  Key "bob" (hash=1204):
    Before: 1204 % 4 = 0
    After:  1204 % 5 = 4  ← moved

  Key "charlie" (hash=3901):
    Before: 3901 % 4 = 1
    After:  3901 % 5 = 1  ← same! (lucky)

  Key "diana" (hash=5547):
    Before: 5547 % 4 = 3
    After:  5547 % 5 = 2  ← moved

  ~75% of keys move when adding just one node

With naive modulo hashing, adding one node forces most keys to move.

This is the problem that consistent hashing solves.

Consistent Hashing

Consistent hashing is a clever technique that reduces the number of keys that need to move when you add or remove a node. Instead of keys mapping to partitions, both keys and nodes are mapped onto a ring (a circular number line from 0 to some large number like 2³²).

  The hash ring (0 → 2³²):

              0
           ┌──────┐
       N4 ●│      │● N1
           │      │
  2³²/4×3  │      │  2³²/4
           │      │
       N3 ●│      │● N2
           └──────┘
            2³²/2

  A key is assigned to the first node clockwise from its hash position.

  hash("alice") = position 105 → next node clockwise = N1
  hash("bob")   = position 430 → next node clockwise = N2
  hash("diana") = position 820 → next node clockwise = N3

The consistent hash ring. Each key goes to the first node clockwise from its position.

Now when you add a new node, you place it somewhere on the ring. It only "takes" keys from the node immediately counter-clockwise. Every other node is unaffected.

  Before: N1, N2, N3, N4 on the ring
  Add:    N5 between N1 and N2

              0
           ┌──────┐
       N4 ●│      │● N1
           │      │
          N5● (new node added here)
           │      │
       N3 ●│      │● N2
           └──────┘

  Only the keys between N1 and N5 move from N2 to N5.
  All other keys stay where they are.
  ~1/N of keys move, not ~N-1/N.

Adding a node to a consistent hash ring only moves keys from one adjacent node.

This is the core insight: instead of rehashing everything, you only rehash the fraction of keys that fall between the new node and its predecessor. If you have 10 nodes and add one more, roughly 1/10 of the keys move — not 90%.

Virtual nodes: solving the uneven load problem

There is a subtle problem with basic consistent hashing. When you place 4 nodes on a ring, there is no guarantee they are evenly spaced. Random placement means some nodes might end up responsible for 30% of the keyspace and others only 10%.

The solution is virtual nodes (vnodes). Instead of each physical machine occupying one position on the ring, it occupies many positions. A node with twice the capacity gets twice the virtual nodes. The ring still works the same way — keys go to the nearest node clockwise — but now each physical machine handles many small segments of the ring rather than one large one.

  Without vnodes (uneven):
  ┌──────────────────────────────────┐
  │   N1 (30%)  │  N2 (10%)  │  N3 (15%)  │  N4 (45%)  │
  └──────────────────────────────────┘

  With vnodes (256 positions per node — even):
  N1●  N3●  N2●  N4●  N1●  N2●  N3●  N4●  N1●  N3●  ...
  ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ...
  Each physical node responsible for many small, scattered segments.
  Variance washes out → roughly 25% each.

Virtual nodes spread each physical machine across many ring positions, evening out the load.

Cassandra uses 256 virtual nodes per physical node by default. DynamoDB uses a similar approach internally. This is why these systems handle node additions and removals gracefully without massive data movement.

Key insight Consistent hashing solves rebalancing, not hot spots. If user_id:12345 is a celebrity with 50 million followers and every interaction with their content hits that partition, consistent hashing doesn't help. You still need a different strategy to spread that specific key's load — which we cover in the Hot Partitions section.

The Partition Key — The Most Important Decision

We have been talking about partitioning strategies abstractly, but the real question is: what field do you partition on? This decision shapes everything else in your system — the queries that are fast, the queries that are slow, where hot spots appear, and how hard cross-partition work becomes.

There is no universal right answer. But there is a wrong process: choosing the partition key based on what feels natural, rather than based on your actual access patterns.

Step 1: Write down your access patterns before deciding

Before touching any partitioning strategy, answer these questions concretely:

What are your top 5 most frequent queries? What fields do they filter on?
What are your most write-heavy operations? What does a write look like?
Are there any entities that will receive a disproportionate number of reads or writes?
Do you need range scans? On which fields?
What data do you almost always need together? (This data should live in the same partition.)

The partition key should make the most common query pattern single-partition. If 90% of your queries are "get all orders for user X", partition by user_id. All of that user's orders are on one machine — one network call, no scatter-gather.

A real example: an e-commerce order system

You have an orders table. What should the partition key be?

Partition Key	Fast queries	Slow queries	Hot spot risk
`user_id`	"All orders for user X" ✓ single partition	"All orders placed today" — scatter-gather ✗	Power users with millions of orders medium
`order_id` (hash)	Single order lookup ✓	All orders for a user — scatter-gather ✗ "All orders today" — scatter-gather ✗	Very low ✓
`timestamp` (range)	"All orders placed today" ✓	"All orders for user X" — scatter-gather ✗	Severe write hot spot on current time ✗
`user_id + order_id` (compound)	"All orders for user X" ✓ · Single order lookup ✓	"All orders placed today" — scatter-gather ✗	Power users medium

Most e-commerce systems land on user_id as the partition key because "get all orders for a user" is by far the most common query. Analytical queries like "all orders placed today" go through a separate analytical pipeline, not the OLTP partition.

The compound partition key

Many systems (Cassandra, DynamoDB) let you define a compound key: a partition key (determines which node) plus a sort key (determines order within the partition). This is a powerful combination.

  DynamoDB / Cassandra compound key pattern:

  Table: user_orders
  ┌──────────────────┬────────────────┬──────────────┐
  │ partition_key    │ sort_key       │ attributes   │
  │ user_id          │ created_at     │ amount, ...  │
  ├──────────────────┼────────────────┼──────────────┤
  │ user:1001        │ 2024-01-15     │ $49.99       │
  │ user:1001        │ 2024-01-22     │ $12.00       │  ← same partition
  │ user:1001        │ 2024-02-03     │ $98.00       │
  │ user:2047        │ 2024-01-18     │ $7.50        │  ← different partition
  └──────────────────┴────────────────┴──────────────┘

  Query: "all orders for user 1001 in January 2024"
  → Single partition hit (user:1001)
  → Range scan on sort key (2024-01-01 → 2024-01-31)
  → Fast.

A compound key gives you both partition locality and range scans within a partition.

This pattern is very common. You get the distribution benefit of hashing (on the partition key) and the range scan benefit of ordering (on the sort key) — within a single partition. The limitation is that range scans across partitions still require scatter-gather.

Hot Partitions — The Silent Killer

A hot partition is a partition that receives significantly more traffic than the others. Your overall cluster might have plenty of capacity — 90% of nodes are idle — but the hot partition is the bottleneck, and you can't scale past it without redesigning the data model.

"One node at 100% while nine others sit at 10% means you have a 10-node cluster with one node of effective capacity."

Why hot partitions appear

They appear when your data's access pattern is not uniform — which is almost always the case in real systems.

Celebrity / power user problem. Imagine a social network where you partition posts by author_id. A normal user might have 200 followers. A celebrity might have 40 million. Every time they post, every follower's feed query hits that one partition. You've put a celebrity's data on one machine that now handles requests proportional to their audience.

Temporal hot spots. Partition an event log by timestamp. All new writes go to the partition holding "now". It doesn't matter how many nodes you have — the write-hot partition is always the newest one.

Business hot spots. A Black Friday sale might cause 80% of all orders to go through ten SKUs that are all in the same partition. A viral moment can send millions of users to the same product page.

Detecting hot partitions

Hot partitions hide until they cause pain. You need to actively look for them:

Plot CPU/memory/disk I/O per node, not aggregate. Aggregates mask the imbalance.
Track the distribution of keys in your request logs. If one key or range of keys accounts for a disproportionate fraction of traffic, you have a hot spot forming.
Track partition-level latency. A hot partition will show higher p99 latency than its neighbors.

Monitoring trap Most dashboards show average or aggregate metrics. A cluster where one node is at 100% and nine are at 10% will show an average CPU of 19% — which looks fine. Always monitor per-node distributions, not just averages.

Fixing hot partitions — your options

Option 1: Change the partition key (the real fix). If you partitioned by author_id and celebrities are killing you, consider partitioning by something that distributes more evenly. This usually requires a significant migration.

Option 2: Add a random suffix to hot keys. For a hot key like celebrity:12345, instead of one partition, spread it across, say, 10 virtual partitions: celebrity:12345:0 through celebrity:12345:9. Writes pick one at random. Reads must query all 10 and merge. This is called key salting.

  Key salting for a hot celebrity post:

  Without salting:
  All reads for celebrity 12345 → single partition (bottleneck)

  With salting (N=10):
  Write: pick random suffix 0-9
    "post:12345:3" → partition A
    "post:12345:7" → partition B
    "post:12345:1" → partition C
    ...

  Read: query all 10 partitions, merge results
  10× read amplification, but no single hot partition

  Trade-off: reads become more expensive, writes become cheaper per partition.

Key salting splits a single hot key across multiple partitions at the cost of read amplification.

Option 3: Cache aggressively in front of the hot partition. If the hot partition is read-dominated (as with a celebrity's profile), putting a cache in front can absorb most of the load without changing the data model. This doesn't help if the hot spot is write-dominated.

Option 4: Denormalize and pre-compute. Instead of reading from the hot partition at query time, pre-compute the result and store it somewhere less hot. Fan-out on write instead of fan-in on read. Twitter famously moved from fan-in (query a celebrity's partition at read time) to fan-out (push tweets to followers' timelines at write time) for exactly this reason.

Secondary Indexes Across Partitions

Your primary access pattern is efficient with the right partition key. But what about the secondary patterns? "Find all users in city X" or "find all products with price under $50"? These are queries that don't use the partition key, so they can't route to a single partition. This is the secondary index problem.

There are two approaches, and they make opposite tradeoffs.

Document-partitioned indexes (local index)

Each partition maintains its own secondary index, covering only the data in that partition. When you write a record, you update the index on the same partition — one write, one partition, no cross-partition coordination.

  Partition 1                    Partition 2
  ┌─────────────────────────┐    ┌─────────────────────────┐
  │ Data: users in NY        │    │ Data: users in CA        │
  │                          │    │                          │
  │ Local index:             │    │ Local index:             │
  │  city:NY → [u1, u3, u7]  │    │  city:CA → [u2, u5, u9]  │
  └─────────────────────────┘    └─────────────────────────┘

  Write "user u11, city=NY":
  → goes to Partition 1 only
  → Partition 1 updates its local city index
  → one partition touched ✓

  Query "all users in NY":
  → must query ALL partitions (some might have NY users!)
  → merge results
  → scatter-gather ✗

Local indexes make writes cheap and reads expensive for non-partition-key queries.

This is what DynamoDB's local secondary index and MongoDB's per-shard index do. Writes are cheap. But queries on the secondary index must fan out to all partitions and collect results — scatter-gather. For large clusters, this can mean hitting hundreds of nodes for a single query.

Term-partitioned indexes (global index)

Instead of each partition holding its own local index, you build a global index and partition the index itself. The index for "users in city NY" lives on one specific index partition — not scattered across all data partitions.

  Data partitions (by user_id)   Index partitions (by city)
  ┌───────────────────────────┐  ┌──────────────────────────┐
  │ P1: user_ids 1-1000       │  │ Index P1: cities A-M     │
  │ P2: user_ids 1001-2000    │  │  city:Chicago→[u45,u910] │
  │ P3: user_ids 2001-3000    │  │  city:Miami→[u3,u1500]   │
  └───────────────────────────┘  ├──────────────────────────┤
                                 │ Index P2: cities N-Z     │
                                 │  city:NY→[u7,u23,u890]   │
                                 │  city:Seattle→[u201]     │
                                 └──────────────────────────┘

  Query "all users in NY":
  → Hit Index P2 (single partition) → get list of user_ids
  → Fetch those user_ids from data partitions
  → index read = single partition ✓

  Write "user u11, city=NY":
  → Update data in data partition P1 (user_ids 1-1000)
  → Update index in Index P2 (cities N-Z)
  → two partitions touched, no atomic transaction ✗

Global indexes make reads fast but writes require updating two separate partitions.

Global indexes make reads on the secondary key much faster — often a single partition. But writes are more complex. Writing a record now requires updating two partitions: the data partition and the index partition. These two updates cannot be atomic without a distributed transaction. Most systems handle this by making global indexes eventually consistent — the index will catch up, but there is a window where it lags behind the data.

DynamoDB's global secondary index is eventually consistent for this reason. Elasticsearch uses a variation of term-partitioned indexes. This is not a design flaw — it is the correct trade-off given that synchronous cross-partition coordination is expensive and fragile.

Cross-Partition Queries, Joins, and Transactions

Partitioning gives you write scalability and storage capacity. What it takes away is the ability to easily work with data that lives on different nodes. This is not optional complexity you can design around — it is the fundamental cost of partitioning.

Cross-partition reads: scatter-gather

When a query needs data from multiple partitions, the coordinator sends the query to each relevant partition in parallel, waits for all of them to respond, and merges the results. This is called scatter-gather or fan-out.

The problem with scatter-gather is tail latency. Even if 99% of your partitions respond in 5ms, if one is slow (due to load, garbage collection, or network jitter), your query waits for it. The more partitions you fan out to, the higher the chance you hit a slow one. Google's research showed that large fan-out operations routinely hit the 1% tail latency because with enough servers, the 1% happens with certainty.

Tail latency trap A query that fans out to 100 partitions with p99 latency of 10ms will see the 99th-percentile node at least once in every request. Your query's p99 becomes much worse than any individual partition's p99. The antidote: parallel fan-out with timeouts, return partial results, or redesign to avoid wide scatter-gather.

Cross-partition joins

In a relational database, joining two tables is handled by the query engine. In a partitioned system, if the data for the join lives on different nodes, you need to move data across the network to do the join. There are two approaches:

Repartition join (shuffle join): Move data to co-locate the join keys. Both sides are rehashed on the join key so matching records end up on the same node. This is what MapReduce and Spark do. Very expensive for OLTP, standard practice for analytical batch workloads.

Broadcast join: If one side of the join is small, send a copy of the entire small side to every partition. Each partition then joins locally. Spark calls this a "broadcast join". Only works when one side is small enough to fit in memory on every node.

Neither is free. For OLTP systems, the practical advice is: design your data model so that the common join doesn't need to cross partitions. Co-locate data that is often queried together under the same partition key.

Cross-partition transactions

This is the hardest case. You want to atomically update records on two different partitions — either both changes happen or neither does. Without this guarantee, you can end up in inconsistent states.

The standard mechanism is two-phase commit (2PC): a coordinator asks all involved partitions to "prepare" the change, and only commits if all agree. It works, but it is slow (two round-trips plus locking), and it fails badly when the coordinator crashes mid-transaction.

Most high-throughput distributed systems avoid 2PC entirely. The common alternatives are:

Design transactions to be single-partition. If a business operation requires atomic updates to two records, try to put those records in the same partition. Often you can restructure the data model to make this possible.
Use sagas (sequences of local transactions with compensating transactions for rollback). This is eventual consistency at the application level — correct states are eventually reached, but there is a window of inconsistency during the saga.
Accept weaker guarantees for operations that are inherently cross-partition. Many real systems accept "best-effort" cross-partition updates with reconciliation logic for failures.

Rebalancing — When Your Cluster Changes

Over time, your cluster will change. You'll add nodes when you outgrow your current capacity. You'll remove nodes when hardware fails or you downsize. Data distributions will shift as your application evolves. Rebalancing is the process of moving data around to restore even distribution.

What makes rebalancing hard

The challenge is that rebalancing must happen while the system is still serving traffic. You can't take a maintenance window on a database that's serving production load. This means:

Data is moving between nodes while reads and writes are happening
Routing tables need to be updated atomically as data moves
Partially-moved data must be readable from both source and destination during the transition
A rebalance in progress consumes network and disk bandwidth, slowing down normal operations

Strategies for cleaner rebalancing

More partitions than nodes. The classic mistake is having one partition per node. When you add a node, you have to split a partition in half and move data — messy and slow. If instead you create many more partitions than you have nodes (say, 1000 partitions on 10 nodes), adding a node just means moving some whole partitions to it. No splits needed. Cassandra, MongoDB, and Elasticsearch all default to this approach.

  1 partition per node (fragile):
  ┌─────┐ ┌─────┐ ┌─────┐
  │ P1  │ │ P2  │ │ P3  │  ← adding a node requires splitting partitions
  │ N1  │ │ N2  │ │ N3  │
  └─────┘ └─────┘ └─────┘

  Many partitions per node (robust):
  ┌──────────────────────┐  ┌──────────────────────┐  ┌────────────────────┐
  │ P1, P2, P3, P4, P5   │  │ P6, P7, P8, P9, P10  │  │ P11, P12, P13 ...  │
  │         N1           │  │         N2           │  │        N3          │
  └──────────────────────┘  └──────────────────────┘  └────────────────────┘

  Add N4: move P1, P6, P11 (and a few others) to N4.
  No splits. Minimal data movement. Clean.

More partitions than nodes makes adding nodes a simple partition move, not a split.

Rate-limit the rebalance. Moving data consumes disk I/O and network. If unconstrained, a rebalance can saturate your disks and degrade the entire cluster's latency for hours. Good systems allow you to configure a maximum rebalance rate so it happens gradually in the background.

Don't trigger rebalancing automatically on transient failures. If a node goes down for 30 seconds due to a garbage collection pause and you immediately start rebalancing, you're doing unnecessary work when it comes back. Good systems wait for a configurable timeout before deciding a node is truly gone. Cassandra defaults to a "never auto-rebalance" approach; operators trigger it manually after a topology change is confirmed.

Putting It Together: A Decision Framework

When you sit down to design the partitioning for a new system, work through these questions in order:

1. Do you actually need partitioning? A single well-tuned relational database can handle tens of thousands of writes per second and store terabytes. Partitioning adds enormous complexity. If you're not genuinely data-volume or write-throughput constrained, a strong single-node or replicated setup may be simpler and right.

2. What are your access patterns? Write down the 5 most common queries and 5 most common writes. Rank them by frequency. The partition key should make the top items fast.

3. Range scans or not? If yes, range partitioning. If no, hash partitioning. If both — this is where compound keys or denormalization can help.

4. What data should be co-located? Data that is almost always queried together should be in the same partition. This often means putting all of a user's data under user_id, all of an order's line items under order_id, and so on.

5. Where will your hot spots be? Name them explicitly. For every obvious hot spot, decide how you'll handle it before it hits production.

6. What are your cross-partition operations? For each one, decide: can you redesign to avoid it, or can you tolerate the cost? A system with three scatter-gather queries on every page load will struggle at scale. A system with zero cross-partition operations will have a constrained data model. The answer is somewhere in between, and it depends on your traffic and latency targets.

Partitioning — Divide and Conquer, Carefully

What's in this chapter

Key Learnings

Why Partitioning Exists

Strategy 1: Range Partitioning

The hot spot problem in range partitioning

Mitigating range hot spots

Strategy 2: Hash Partitioning

What you give up: range scans

The naive hash problem: reshuffling everything when you add nodes

Consistent Hashing

Virtual nodes: solving the uneven load problem

The Partition Key — The Most Important Decision

Step 1: Write down your access patterns before deciding

A real example: an e-commerce order system

The compound partition key

Hot Partitions — The Silent Killer

Why hot partitions appear

Detecting hot partitions

Fixing hot partitions — your options

Secondary Indexes Across Partitions

Document-partitioned indexes (local index)

Term-partitioned indexes (global index)

Cross-Partition Queries, Joins, and Transactions

Cross-partition reads: scatter-gather

Cross-partition joins

Cross-partition transactions

Rebalancing — When Your Cluster Changes

What makes rebalancing hard

Strategies for cleaner rebalancing

Putting It Together: A Decision Framework

Three questions for your next design review