8.18 Design a key-value store - The System Design Interview Handbook

Here is the step-by-step design for a distributed Key-Value Store, similar to Amazon DynamoDB or Apache Cassandra.

1. Problem Definition and Scope

We are designing a distributed Key-Value (KV) store. This is a NoSQL database that allows applications to store data as a collection of key-value pairs, where the key is a unique identifier and the value is opaque binary data (the system does not need to understand the content).

Main User Groups:
- Application Developers: They use the KV store as the backend for microservices (e.g., storing user sessions, shopping carts, or product preferences).
- System Administrators: They manage the cluster hardware, although the system should be largely self-healing.
Scope: We will focus on the backend architecture required to store, retrieve, and replicate data at scale.
- In Scope: Data partitioning (sharding), replication, consistency models (tunable), read/write paths, and handling node failures.
- Out of Scope: Multi-region replication (we will focus on a single region with multiple availability zones), complex SQL querying, billing, and user authentication.

2. Clarify functional requirements

Must Have:

Put(key, value): Users can store or update a value associated with a key.
Get(key): Users can retrieve the value associated with a key.
Delete(key): Users can remove a key-value pair.
Configurable Consistency: Users can choose between strong consistency (slower) or eventual consistency (faster) for each request.
High Availability: The system must accept writes even if some nodes are down.

Nice to Have:

Versioning: Support for version numbers or timestamps to handle concurrent updates.
TTL (Time To Live): Automatically delete keys after a certain time expiration.

Functional Requirements

3. Clarify non-functional requirements

Scale: Support 100 TB of data and billions of keys.
Throughput: 100,000 queries per second (QPS).
Latency: Ultra-low. Reads and writes should complete in single-digit milliseconds (e.g., < 10ms for 99th percentile).
Availability: "Five nines" (99.999%). We prioritize Availability over Consistency (AP system in CAP theorem) by default, but this is tunable.
Durability: Zero data loss. Data must be replicated across multiple physical nodes (typically 3).
Scalability: Horizontal scaling. We should be able to add cheap servers to the cluster to handle growth.

Non-Functional Requirements

4. Back of the envelope estimates

Storage Requirements:
- Average object size: 2 KB.
- Total Data Target: 100 TB.
- Number of records: 100 TB / 2 KB = 50 Billion records.
- Replication: We will store 3 copies of every piece of data.
- Total Raw Storage = 100 TB * 3 = 300 TB.
Throughput & Bandwidth:
- QPS: 100,000 requests/sec.
- Let's assume 50% Reads / 50% Writes.
- Incoming Bandwidth: $100,000 \times 2 \text{ KB} = 200 \text{ MB/sec}$.
- Internal Replication Bandwidth: Write traffic(100 MB/s) $\times$ 3 replicas = 300 MB/s internal traffic just for writes.
Node Estimation:
- If one modern server handles 2 TB of disk and 5,000 QPS safely:
- Storage needs: $300 \text{ TB} / 2 \text{ TB} = 150 \text{ servers}$.
- QPS needs: $100,000 / 5,000 = 20 \text{ servers}$.
- We go with the higher number (storage bound). We need roughly 150 servers.

Back-of-the-envelope estimation

5. API design

We will use a RESTful API for simplicity, though internal communication between nodes would use gRPC for performance.

1. Put Data

Method: PUT /v1/kv/{key}
Body: { "value": "base64_encoded_data", "consistency": "QUORUM", "ttl": 3600 }
Response: 204 No Content (or 409 Conflict if version mismatch).

2. Get Data

Method: GET /v1/kv/{key}
Params: ?consistency=ONE (Optional)
Response: { "key": "user_123", "value": "...", "version": 102 }
Error: 404 Not Found.

3. Delete Data

Method: DELETE /v1/kv/{key}
Response: 204 No Content.

6. High level architecture

We will use a Leaderless Architecture (similar to DynamoDB or Cassandra). This means there is no single "Master" node. All nodes are peers. This removes the single point of failure and improves write availability.

Architecture Flow:

Client -> Load Balancer -> Coordinator Node -> Storage Nodes (Replicas)

Client: The application using our SDK.
Load Balancer: Distributes incoming HTTP requests to any available node in the cluster.
Coordinator Node: Any node in the cluster can act as a coordinator. It receives a request, calculates which specific nodes hold the data (based on the key), and forwards the request to them.
Storage Nodes: The servers that physically store the data on disk.
Gossip Protocol: A background communication mechanism where nodes talk to each other to share cluster state (which nodes are up/down).

High-level Architecture

7. Data model

We are building a KV store, but we need to define how the data is organized internally.

Storage Engine: LSM Trees (Log-Structured Merge Trees)

We will use LSM Trees (like LevelDB or RocksDB) instead of B-Trees.

Why: LSM trees are optimized for heavy write workloads.Writes are appended to a log file sequentially (very fast) rather than performing random disk I/O.

Internal Schema (per node):

We don't use SQL tables. We use a sorted map structure on disk (SSTables).

Key: The partition key (hashed).
Value: Binary blob.
Timestamp: Real-time clock timestamp or vector clock for conflict resolution.
Tombstone: A boolean flag marking if a record was deleted.

8. Core flows end to end

This is the most critical part of the design. To understand the flows, we must first define three variables for tunable consistency:

N: The number of replicas (usually 3).
W: Write Quorum (how many nodes must confirm a write).
R: Read Quorum (how many nodes must respond to a read).
Rule: If R + W > N, we guarantee Strong Consistency (we will always read the latest write).

Flow 1: Write Path (PUT key, value)

This flow explains how data travels from the user to the disk safely.

Request Arrival: The Client sends a PUT request. It hits the Load Balancer, which forwards it to a random node in the cluster (e.g., Node 1). Node 1 becomes the Coordinator for this specific request.
Mapping the Key: The Coordinator hashes the key (e.g., hash("user_123") = 0x5A...). It looks at the Consistent Hash Ring map to see which node owns this hash range. It finds that Node A is the owner.
Identifying Replicas: Since our Replication Factor ($N$) is 3, the Coordinator identifies Node A and its two clockwise neighbors, Node B and Node C, as the targets.
Parallel Send: The Coordinator sends the data to Node A, Node B, and Node C simultaneously.
Node Processing (The "Write"):
- Inside Node A (and B, C), the data is first written to a Commit Log (append-only file) on the disk. This ensures that if the power fails instantly, the data is safe.
- Next, the data is written to the MemTable (system RAM). This is very fast.
- The Node sends a "Success" signal back to the Coordinator.
Quorum Check: The Coordinator waits. If $W=2$ (Quorum), it waits for the first 2 success signals.
Success Response: Once 2 nodes have confirmed, the Coordinator sends 200 OK to the Client. The Client sees the write as "done," even if the 3rd node is still processing in the background.

Flow 1

Flow 2: Read Path (GET key)

This flow explains how we handle potentially conflicting data from different nodes.

Request Arrival: Client sends GET to the Load Balancer, which forwards to a Coordinator.
Scatter: The Coordinator hashes the key, finds the 3 responsible nodes (A, B, C), and sends a read request to all of them.
Gather & Compare:
- The Coordinator waits for $R$ responses (e.g., 2 responses).
- Suppose Node A returns: { value: "Hello", version: 1 }.
- Suppose Node B returns: { value: "Hola", version: 2 }.
Conflict Resolution: The Coordinator compares the versions. Version 2 is higher (newer) than Version 1. The Coordinator trusts the data with Version 2.
Read Repair (Self-Healing):
- Before replying to the user, or asynchronously in the background, the Coordinator sends the new data ("Hola", v2) to Node A.
- This forces Node A to update its stale data. This is how the system "heals" itself over time.
Final Response: The Coordinator returns "Hola" to the Client.

Flow 2

Flow 3: Delete Path

Soft Delete: We cannot simply delete the file from disk immediately because distributed nodes might revive it (syncing old data back to new nodes).
Tombstone: Instead, we write a special object called a "Tombstone" with the key. It basically says, "This key is dead as of timestamp X."
Compaction: Later, a background process (Garbage Collection) scans the disk. If it sees a Tombstone that is older than a certain threshold (e.g., 10 days), it permanently removes the data to reclaim space.

Flow 3

9. Caching and read performance

Since disk I/O is slow, we need layers to speed up reads.

MemTable (Level 1 Cache): Every write goes into memory (MemTable) first. When we read, we check the MemTable first. If the data was recently written, it is returned instantly from RAM.
Block Cache (Level 2 Cache): We allocate a portion of the server's RAM to cache uncompressed blocks of data from the disk (SSTables) that are frequently read.
Bloom Filters:
- Problem: To check if a key exists on disk, we might have to scan many files.
- Solution: A Bloom Filter is a memory-efficient structure that answers "Definitely No" or "Maybe Yes".
- Before touching the disk, we ask the Bloom Filter: "Do you have this key?" If it says No, we skip the disk read entirely. This drastically reduces latency for keys that don't exist.

10. Storage, indexing and media

Primary Storage (SSTables):

Data starts in the MemTable (RAM).
When the MemTable is full (e.g., reaches 64MB), it is flushed to disk as an SSTable (Sorted String Table).
SSTables are Immutable (never modified).
Updates create a new SSTable with the new value.

Compaction:

Over time, we have many SSTables (e.g., Level 0, Level 1).
A background process merges these files, keeps only the latest version of a key, and discards old overwritten values or tombstones.

Media:

Our KV store is for small binary data (User Profiles, Metadata, Carts).
We do not store large images or videos here.
We store media in Object Storage (like S3) and store the S3 URL in our KV store.

11. Scaling strategies

How do we grow from 150 nodes to 1,000 nodes?

1. Virtual Nodes (VNodes):

Problem: If we simply place 150 nodes on the hash ring, removing one node creates a huge gap that only one neighbor has to fill. That neighbor will be overwhelmed.
Solution: We divide the hash ring into many small segments.
Each physical server hosts many "Virtual Nodes" (e.g., Server A handles segments 5, 200, and 800).
When we add a new server, it "steals" a few small segments from every other server. This balances the load instantly and evenly.

2. Gossip Protocol:

We don't want a centralized "Master" registry that can fail.
Every second, each node picks a random other node and says: "Here is the list of nodes I know and their status."
Information spreads like a virus (gossip). Within seconds, the whole cluster knows if a new node joined or an old one died.

12. Reliability, failure handling and backpressure

Handling Node Failures (Hinted Handoff):

If Node A is the owner of a key, but Node A is temporarily down:
- The Coordinator writes the data to Node D (a neighbor) with a "Hint".
- Note: "Please give this back to Node A when it comes back online."
- This ensures writes don't fail even if the owner is down (Sloppy Quorum).

Anti-Entropy (Merkle Trees):

Nodes need to check if they are in sync without transferring terabytes of data.
They use Merkle Trees (hash trees). They compare the root hash. If it matches, data is identical. If not, they compare children.
This allows them to pinpoint exactly which single key is different and sync only that key.

Backpressure:

If a node is overwhelmed (write queue is full), it should reject requests immediately (503 Overloaded) rather than crashing. The client SDK should handle this with Exponential Backoff.

13. Security, privacy and abuse

Encryption at Rest: All SSTables written to disk are encrypted using AES-256.
Encryption in Transit: All communication (Client to Node, and Node to Node) happens over TLS/SSL.
Mutual TLS (mTLS): Nodes authenticate each other using certificates to prevent rogue servers from joining the cluster.
Rate Limiting: To prevent "noisy neighbor" problems, we implement token-bucket rate limiters per partition. If one customer sends 50k QPS to a single key, we throttle them so they don't crash the shard.

14. Bottlenecks and next steps

Current Bottlenecks:

Hot Keys: If a celebrity tweets, millions of people read one key. This hits one specific partition hard.
Repair Traffic: If a node dies and comes back, syncing data (Hinted Handoff + Merkle Tree repair) consumes lots of network bandwidth, slowing down regular requests.

Next Steps / Improvements:

Caching Layer: Add a dedicated distributed cache (like Redis) in front of the KV store for ultra-hot keys (The "Celebrity" problem).
Rack Awareness: Configure the Hash Ring so that replicas are placed on different physical racks or availability zones. This prevents data loss if a whole rack loses power.

Summary

Architecture: Leaderless, peer-to-peer ring using Consistent Hashing.
Storage: LSM Trees on disk for high write throughput.
Consistency: Tunable (Quorum reads/writes) with eventual consistency repair mechanisms (Read Repair, Hinted Handoff).
Scale: Uses Virtual Nodes to distribute data evenly across hundreds of servers.