How you split data across machines, and why getting it wrong is one of the most expensive mistakes in distributed systems.
Partitioning is the only way to scale beyond one machine's limits. Replication gives you more readers, but you still have one machine's worth of write capacity and storage. Partitioning gives you both.
The partition key is not a detail, it is the architecture. Everything downstream — query performance, hot spots, cross-partition complexity — follows from this one decision. Choose it before you choose anything else.
Range partitioning enables range scans but creates hot spots. Hash partitioning prevents hot spots but destroys range scans. There is no free lunch. Your access patterns decide which problem is worse to live with.
Hot partitions are the silent killer. One node at 100% CPU while others idle at 10% means your entire cluster is bottlenecked at one machine. Load balancing more machines doesn't help.
Consistent hashing solves the rebalancing problem, not the hot-spot problem. When a node joins or leaves, only a fraction of keys move — not everything. But a key that is inherently hot is still hot even with consistent hashing.
Cross-partition operations are expensive by definition. Every join, aggregation, or transaction that spans partitions requires network round-trips, coordination, and careful handling of partial failures. Design your data model to make the common case single-partition.
Secondary indexes do not respect partition boundaries. A write goes to one partition; a secondary index query may touch all of them. Know which type of secondary index your system uses — document-partitioned (scatter-gather) or term-partitioned (eventual consistency trade-off).
Repartitioning mid-flight is painful but not catastrophic if planned. The systems that handle it gracefully are the ones that built rebalancing and routing into the design from the start, not bolted it on when the system was already under pressure.
Imagine you have a table with ten billion rows. Even on the fastest machine money can buy, there is a limit to how fast you can write to it, how much of it you can keep in memory, and how quickly you can scan through it. At some point, a single machine is simply not enough.
You could try to solve this with replication — make multiple copies of the data and spread reads across them. That helps with reads. But every copy still needs to accept every write. The write bottleneck and the storage limit remain. You've added redundancy, not capacity.
Partitioning (also called sharding) takes a different approach. Instead of putting all the data on one machine and making copies, you split the data. Each partition holds a subset of the data, and different partitions live on different machines. Now each machine handles a fraction of the writes and stores a fraction of the total data. If you add more machines, you can handle more load.
This sounds simple. It isn't. The moment you split data across machines, a question arises that you can't avoid: when a query comes in, which machine does it go to? And behind that question lurks a hundred more — about how you split the data, how you route requests, how you handle queries that need data from multiple machines, what happens when a machine goes down, and what happens when you need to add more machines later.
The answers to all these questions flow from one decision made early in the design: the partition key. Let's understand the strategies available to you first, and then talk about how to choose.
The most intuitive way to partition data is by range. You sort your data by some key, then divide it into consecutive ranges — like alphabetical sections in an encyclopedia.
All records sorted by last_name:
Partition 1 Partition 2 Partition 3 Partition 4
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ A ... F │ │ G ... M │ │ N ... S │ │ T ... Z │
│ │ │ │ │ │ │ │
│ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node 4 │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
Simple range partitioning on last name. Node 2 holds G through M.
The big advantage is that range scans are cheap. If you want all users whose last name starts with "H", you go to exactly one partition. If you want all orders placed between January 1st and January 31st (partitioned by date), you go to exactly the partitions that cover that month.
HBase, BigTable, and early versions of DynamoDB all use range partitioning. Time-series databases use it constantly — the natural partition key is time, and you almost always query a time range.
Now consider what happens if you partition an IoT sensor database by timestamp, and every sensor in the world is sending data right now. All the writes go to the partition covering "today" — a single node handles all the current writes while every other node sits idle.
This is the fundamental problem with range partitioning: the range that is currently "hot" — whether that's today's data, the most popular product category, or the most active user segment — will concentrate load on a single partition. The other partitions sit underutilised.
There are a few ways to push back against this:
Prefix the key with something else. Instead of timestamp, use sensor_id + timestamp. Now writes spread across sensors, and you can still do range scans per sensor. The tradeoff: range scans across all sensors require hitting all partitions.
Pre-split aggressively. Create more partitions than you need now, so you have room to move hot partitions to less-loaded nodes. Cassandra calls these "virtual nodes" and creates 256 by default per physical node for exactly this reason.
Accept the trade-off consciously. If range scans are critical and write hot spots are manageable (low total write volume, or concentrated writes are expected and short-lived), range partitioning is still the right choice.
Instead of assigning a key to a partition based on its value, you run the key through a hash function and assign it based on the hash. Because good hash functions distribute their outputs uniformly, the keys land evenly across partitions — regardless of what the keys look like.
Key → hash(key) → pick partition hash("alice") = 7823 → 7823 % 4 = 3 → Partition 3 hash("bob") = 1204 → 1204 % 4 = 0 → Partition 0 hash("charlie") = 3901 → 3901 % 4 = 1 → Partition 1 hash("diana") = 5547 → 5547 % 4 = 3 → Partition 3 hash("evan") = 2890 → 2890 % 4 = 2 → Partition 2 Result: roughly even distribution across 4 partitions
Hash partitioning with modulo. Each key maps to exactly one partition.
Hash partitioning solves the sequential hot spot problem. Today's timestamp hashes to a different value than yesterday's. A famous user hashes to one partition, not every write to their data lands on the same machine. The load spreads out.
Adjacent keys no longer live in the same partition. "alice" and "alicia" are neighbors in alphabetical order, but they hash to completely different values and live on different partitions. A query for "all users whose name starts with ali" must hit every partition and combine the results. This is called a scatter-gather query and it is expensive — proportional to the number of partitions, not the number of results.
This is not a fixable bug. It is the fundamental consequence of trading sort order for uniform distribution. DynamoDB, Cassandra (for the partition key), and most sharded MongoDB deployments make this trade.
If you have 4 partitions and you hash with hash(key) % 4, everything works until you need a 5th partition. Now you hash with % 5, and almost every key maps to a different partition. You have to move almost all the data. During a migration, this is crippling — you're moving data while still serving traffic.
Before (4 nodes): hash(key) % 4
After (5 nodes): hash(key) % 5
Key "alice" (hash=7823):
Before: 7823 % 4 = 3
After: 7823 % 5 = 3 ← same! (lucky)
Key "bob" (hash=1204):
Before: 1204 % 4 = 0
After: 1204 % 5 = 4 ← moved
Key "charlie" (hash=3901):
Before: 3901 % 4 = 1
After: 3901 % 5 = 1 ← same! (lucky)
Key "diana" (hash=5547):
Before: 5547 % 4 = 3
After: 5547 % 5 = 2 ← moved
~75% of keys move when adding just one node
With naive modulo hashing, adding one node forces most keys to move.
This is the problem that consistent hashing solves.
Consistent hashing is a clever technique that reduces the number of keys that need to move when you add or remove a node. Instead of keys mapping to partitions, both keys and nodes are mapped onto a ring (a circular number line from 0 to some large number like 232).
The hash ring (0 → 2³²):
0
┌──────┐
N4 ●│ │● N1
│ │
2³²/4×3 │ │ 2³²/4
│ │
N3 ●│ │● N2
└──────┘
2³²/2
A key is assigned to the first node clockwise from its hash position.
hash("alice") = position 105 → next node clockwise = N1
hash("bob") = position 430 → next node clockwise = N2
hash("diana") = position 820 → next node clockwise = N3
The consistent hash ring. Each key goes to the first node clockwise from its position.
Now when you add a new node, you place it somewhere on the ring. It only "takes" keys from the node immediately counter-clockwise. Every other node is unaffected.
Before: N1, N2, N3, N4 on the ring
Add: N5 between N1 and N2
0
┌──────┐
N4 ●│ │● N1
│ │
N5● (new node added here)
│ │
N3 ●│ │● N2
└──────┘
Only the keys between N1 and N5 move from N2 to N5.
All other keys stay where they are.
~1/N of keys move, not ~N-1/N.
Adding a node to a consistent hash ring only moves keys from one adjacent node.
This is the core insight: instead of rehashing everything, you only rehash the fraction of keys that fall between the new node and its predecessor. If you have 10 nodes and add one more, roughly 1/10 of the keys move — not 90%.
There is a subtle problem with basic consistent hashing. When you place 4 nodes on a ring, there is no guarantee they are evenly spaced. Random placement means some nodes might end up responsible for 30% of the keyspace and others only 10%.
The solution is virtual nodes (vnodes). Instead of each physical machine occupying one position on the ring, it occupies many positions. A node with twice the capacity gets twice the virtual nodes. The ring still works the same way — keys go to the nearest node clockwise — but now each physical machine handles many small segments of the ring rather than one large one.
Without vnodes (uneven):
┌──────────────────────────────────┐
│ N1 (30%) │ N2 (10%) │ N3 (15%) │ N4 (45%) │
└──────────────────────────────────┘
With vnodes (256 positions per node — even):
N1● N3● N2● N4● N1● N2● N3● N4● N1● N3● ...
├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ├──┤ ...
Each physical node responsible for many small, scattered segments.
Variance washes out → roughly 25% each.
Virtual nodes spread each physical machine across many ring positions, evening out the load.
Cassandra uses 256 virtual nodes per physical node by default. DynamoDB uses a similar approach internally. This is why these systems handle node additions and removals gracefully without massive data movement.
user_id:12345 is a celebrity with 50 million followers and every interaction with their content hits that partition, consistent hashing doesn't help. You still need a different strategy to spread that specific key's load — which we cover in the Hot Partitions section.
We have been talking about partitioning strategies abstractly, but the real question is: what field do you partition on? This decision shapes everything else in your system — the queries that are fast, the queries that are slow, where hot spots appear, and how hard cross-partition work becomes.
There is no universal right answer. But there is a wrong process: choosing the partition key based on what feels natural, rather than based on your actual access patterns.
Before touching any partitioning strategy, answer these questions concretely:
The partition key should make the most common query pattern single-partition. If 90% of your queries are "get all orders for user X", partition by user_id. All of that user's orders are on one machine — one network call, no scatter-gather.
You have an orders table. What should the partition key be?
| Partition Key | Fast queries | Slow queries | Hot spot risk |
|---|---|---|---|
user_id |
"All orders for user X" ✓ single partition | "All orders placed today" — scatter-gather ✗ | Power users with millions of orders medium |
order_id (hash) |
Single order lookup ✓ | All orders for a user — scatter-gather ✗ "All orders today" — scatter-gather ✗ |
Very low ✓ |
timestamp (range) |
"All orders placed today" ✓ | "All orders for user X" — scatter-gather ✗ | Severe write hot spot on current time ✗ |
user_id + order_id (compound) |
"All orders for user X" ✓ · Single order lookup ✓ | "All orders placed today" — scatter-gather ✗ | Power users medium |
Most e-commerce systems land on user_id as the partition key because "get all orders for a user" is by far the most common query. Analytical queries like "all orders placed today" go through a separate analytical pipeline, not the OLTP partition.
Many systems (Cassandra, DynamoDB) let you define a compound key: a partition key (determines which node) plus a sort key (determines order within the partition). This is a powerful combination.
DynamoDB / Cassandra compound key pattern: Table: user_orders ┌──────────────────┬────────────────┬──────────────┐ │ partition_key │ sort_key │ attributes │ │ user_id │ created_at │ amount, ... │ ├──────────────────┼────────────────┼──────────────┤ │ user:1001 │ 2024-01-15 │ $49.99 │ │ user:1001 │ 2024-01-22 │ $12.00 │ ← same partition │ user:1001 │ 2024-02-03 │ $98.00 │ │ user:2047 │ 2024-01-18 │ $7.50 │ ← different partition └──────────────────┴────────────────┴──────────────┘ Query: "all orders for user 1001 in January 2024" → Single partition hit (user:1001) → Range scan on sort key (2024-01-01 → 2024-01-31) → Fast.
A compound key gives you both partition locality and range scans within a partition.
This pattern is very common. You get the distribution benefit of hashing (on the partition key) and the range scan benefit of ordering (on the sort key) — within a single partition. The limitation is that range scans across partitions still require scatter-gather.
A hot partition is a partition that receives significantly more traffic than the others. Your overall cluster might have plenty of capacity — 90% of nodes are idle — but the hot partition is the bottleneck, and you can't scale past it without redesigning the data model.
"One node at 100% while nine others sit at 10% means you have a 10-node cluster with one node of effective capacity."
They appear when your data's access pattern is not uniform — which is almost always the case in real systems.
Celebrity / power user problem. Imagine a social network where you partition posts by author_id. A normal user might have 200 followers. A celebrity might have 40 million. Every time they post, every follower's feed query hits that one partition. You've put a celebrity's data on one machine that now handles requests proportional to their audience.
Temporal hot spots. Partition an event log by timestamp. All new writes go to the partition holding "now". It doesn't matter how many nodes you have — the write-hot partition is always the newest one.
Business hot spots. A Black Friday sale might cause 80% of all orders to go through ten SKUs that are all in the same partition. A viral moment can send millions of users to the same product page.
Hot partitions hide until they cause pain. You need to actively look for them:
Option 1: Change the partition key (the real fix). If you partitioned by author_id and celebrities are killing you, consider partitioning by something that distributes more evenly. This usually requires a significant migration.
Option 2: Add a random suffix to hot keys. For a hot key like celebrity:12345, instead of one partition, spread it across, say, 10 virtual partitions: celebrity:12345:0 through celebrity:12345:9. Writes pick one at random. Reads must query all 10 and merge. This is called key salting.
Key salting for a hot celebrity post: Without salting: All reads for celebrity 12345 → single partition (bottleneck) With salting (N=10): Write: pick random suffix 0-9 "post:12345:3" → partition A "post:12345:7" → partition B "post:12345:1" → partition C ... Read: query all 10 partitions, merge results 10× read amplification, but no single hot partition Trade-off: reads become more expensive, writes become cheaper per partition.
Key salting splits a single hot key across multiple partitions at the cost of read amplification.
Option 3: Cache aggressively in front of the hot partition. If the hot partition is read-dominated (as with a celebrity's profile), putting a cache in front can absorb most of the load without changing the data model. This doesn't help if the hot spot is write-dominated.
Option 4: Denormalize and pre-compute. Instead of reading from the hot partition at query time, pre-compute the result and store it somewhere less hot. Fan-out on write instead of fan-in on read. Twitter famously moved from fan-in (query a celebrity's partition at read time) to fan-out (push tweets to followers' timelines at write time) for exactly this reason.
Your primary access pattern is efficient with the right partition key. But what about the secondary patterns? "Find all users in city X" or "find all products with price under $50"? These are queries that don't use the partition key, so they can't route to a single partition. This is the secondary index problem.
There are two approaches, and they make opposite tradeoffs.
Each partition maintains its own secondary index, covering only the data in that partition. When you write a record, you update the index on the same partition — one write, one partition, no cross-partition coordination.
Partition 1 Partition 2 ┌─────────────────────────┐ ┌─────────────────────────┐ │ Data: users in NY │ │ Data: users in CA │ │ │ │ │ │ Local index: │ │ Local index: │ │ city:NY → [u1, u3, u7] │ │ city:CA → [u2, u5, u9] │ └─────────────────────────┘ └─────────────────────────┘ Write "user u11, city=NY": → goes to Partition 1 only → Partition 1 updates its local city index → one partition touched ✓ Query "all users in NY": → must query ALL partitions (some might have NY users!) → merge results → scatter-gather ✗
Local indexes make writes cheap and reads expensive for non-partition-key queries.
This is what DynamoDB's local secondary index and MongoDB's per-shard index do. Writes are cheap. But queries on the secondary index must fan out to all partitions and collect results — scatter-gather. For large clusters, this can mean hitting hundreds of nodes for a single query.
Instead of each partition holding its own local index, you build a global index and partition the index itself. The index for "users in city NY" lives on one specific index partition — not scattered across all data partitions.
Data partitions (by user_id) Index partitions (by city)
┌───────────────────────────┐ ┌──────────────────────────┐
│ P1: user_ids 1-1000 │ │ Index P1: cities A-M │
│ P2: user_ids 1001-2000 │ │ city:Chicago→[u45,u910] │
│ P3: user_ids 2001-3000 │ │ city:Miami→[u3,u1500] │
└───────────────────────────┘ ├──────────────────────────┤
│ Index P2: cities N-Z │
│ city:NY→[u7,u23,u890] │
│ city:Seattle→[u201] │
└──────────────────────────┘
Query "all users in NY":
→ Hit Index P2 (single partition) → get list of user_ids
→ Fetch those user_ids from data partitions
→ index read = single partition ✓
Write "user u11, city=NY":
→ Update data in data partition P1 (user_ids 1-1000)
→ Update index in Index P2 (cities N-Z)
→ two partitions touched, no atomic transaction ✗
Global indexes make reads fast but writes require updating two separate partitions.
Global indexes make reads on the secondary key much faster — often a single partition. But writes are more complex. Writing a record now requires updating two partitions: the data partition and the index partition. These two updates cannot be atomic without a distributed transaction. Most systems handle this by making global indexes eventually consistent — the index will catch up, but there is a window where it lags behind the data.
DynamoDB's global secondary index is eventually consistent for this reason. Elasticsearch uses a variation of term-partitioned indexes. This is not a design flaw — it is the correct trade-off given that synchronous cross-partition coordination is expensive and fragile.
Partitioning gives you write scalability and storage capacity. What it takes away is the ability to easily work with data that lives on different nodes. This is not optional complexity you can design around — it is the fundamental cost of partitioning.
When a query needs data from multiple partitions, the coordinator sends the query to each relevant partition in parallel, waits for all of them to respond, and merges the results. This is called scatter-gather or fan-out.
The problem with scatter-gather is tail latency. Even if 99% of your partitions respond in 5ms, if one is slow (due to load, garbage collection, or network jitter), your query waits for it. The more partitions you fan out to, the higher the chance you hit a slow one. Google's research showed that large fan-out operations routinely hit the 1% tail latency because with enough servers, the 1% happens with certainty.
In a relational database, joining two tables is handled by the query engine. In a partitioned system, if the data for the join lives on different nodes, you need to move data across the network to do the join. There are two approaches:
Repartition join (shuffle join): Move data to co-locate the join keys. Both sides are rehashed on the join key so matching records end up on the same node. This is what MapReduce and Spark do. Very expensive for OLTP, standard practice for analytical batch workloads.
Broadcast join: If one side of the join is small, send a copy of the entire small side to every partition. Each partition then joins locally. Spark calls this a "broadcast join". Only works when one side is small enough to fit in memory on every node.
Neither is free. For OLTP systems, the practical advice is: design your data model so that the common join doesn't need to cross partitions. Co-locate data that is often queried together under the same partition key.
This is the hardest case. You want to atomically update records on two different partitions — either both changes happen or neither does. Without this guarantee, you can end up in inconsistent states.
The standard mechanism is two-phase commit (2PC): a coordinator asks all involved partitions to "prepare" the change, and only commits if all agree. It works, but it is slow (two round-trips plus locking), and it fails badly when the coordinator crashes mid-transaction.
Most high-throughput distributed systems avoid 2PC entirely. The common alternatives are:
Over time, your cluster will change. You'll add nodes when you outgrow your current capacity. You'll remove nodes when hardware fails or you downsize. Data distributions will shift as your application evolves. Rebalancing is the process of moving data around to restore even distribution.
The challenge is that rebalancing must happen while the system is still serving traffic. You can't take a maintenance window on a database that's serving production load. This means:
More partitions than nodes. The classic mistake is having one partition per node. When you add a node, you have to split a partition in half and move data — messy and slow. If instead you create many more partitions than you have nodes (say, 1000 partitions on 10 nodes), adding a node just means moving some whole partitions to it. No splits needed. Cassandra, MongoDB, and Elasticsearch all default to this approach.
1 partition per node (fragile):
┌─────┐ ┌─────┐ ┌─────┐
│ P1 │ │ P2 │ │ P3 │ ← adding a node requires splitting partitions
│ N1 │ │ N2 │ │ N3 │
└─────┘ └─────┘ └─────┘
Many partitions per node (robust):
┌──────────────────────┐ ┌──────────────────────┐ ┌────────────────────┐
│ P1, P2, P3, P4, P5 │ │ P6, P7, P8, P9, P10 │ │ P11, P12, P13 ... │
│ N1 │ │ N2 │ │ N3 │
└──────────────────────┘ └──────────────────────┘ └────────────────────┘
Add N4: move P1, P6, P11 (and a few others) to N4.
No splits. Minimal data movement. Clean.
More partitions than nodes makes adding nodes a simple partition move, not a split.
Rate-limit the rebalance. Moving data consumes disk I/O and network. If unconstrained, a rebalance can saturate your disks and degrade the entire cluster's latency for hours. Good systems allow you to configure a maximum rebalance rate so it happens gradually in the background.
Don't trigger rebalancing automatically on transient failures. If a node goes down for 30 seconds due to a garbage collection pause and you immediately start rebalancing, you're doing unnecessary work when it comes back. Good systems wait for a configurable timeout before deciding a node is truly gone. Cassandra defaults to a "never auto-rebalance" approach; operators trigger it manually after a topology change is confirmed.
When you sit down to design the partitioning for a new system, work through these questions in order:
1. Do you actually need partitioning? A single well-tuned relational database can handle tens of thousands of writes per second and store terabytes. Partitioning adds enormous complexity. If you're not genuinely data-volume or write-throughput constrained, a strong single-node or replicated setup may be simpler and right.
2. What are your access patterns? Write down the 5 most common queries and 5 most common writes. Rank them by frequency. The partition key should make the top items fast.
3. Range scans or not? If yes, range partitioning. If no, hash partitioning. If both — this is where compound keys or denormalization can help.
4. What data should be co-located? Data that is almost always queried together should be in the same partition. This often means putting all of a user's data under user_id, all of an order's line items under order_id, and so on.
5. Where will your hot spots be? Name them explicitly. For every obvious hot spot, decide how you'll handle it before it hits production.
6. What are your cross-partition operations? For each one, decide: can you redesign to avoid it, or can you tolerate the cost? A system with three scatter-gather queries on every page load will struggle at scale. A system with zero cross-partition operations will have a constrained data model. The answer is somewhere in between, and it depends on your traffic and latency targets.
Choose the partition key that makes your most common operation hit exactly one partition — everything else follows from that decision.
Choosing a partition key that makes the data model feel natural (like timestamp or a sequential ID) rather than one that matches the actual query patterns — and discovering the resulting hot spots six months later when the system is under production load and hard to migrate.