## 2.1 Storage Fundamentals
Before you can choose a database, you need to understand the different ways computers store raw data.
Not every piece of data belongs in a database.
Some data belongs in object storage. Some belongs on a distributed file system.
Knowing the difference saves you from forcing a square peg into a round hole.
**Block Storage, Object Storage, and File Storage**
These are the three fundamental approaches to storing data, and each one organizes information differently.
Block storage divides data into fixed-size chunks called blocks and stores them on a disk. Each block has an address, and the storage system can read or write individual blocks without touching the rest. Your computer's hard drive or SSD uses block storage.
In cloud environments, services like AWS EBS (Elastic Block Store) and Azure Managed Disks provide block storage that attaches to virtual machines.
Block storage is fast and ideal for databases and operating systems because they need low-latency, random access to specific portions of data.
File storage organizes data into files and directories, exactly like the folder structure on your laptop. It uses protocols like NFS (Network File System) or SMB (Server Message Block) to let multiple machines access the same file system over a network.
AWS EFS and Azure Files are cloud examples.
File storage works well when multiple servers need to read and write the same files, like shared configuration files, media assets being processed by a rendering pipeline, or log files aggregated from multiple machines.
Object storage takes a fundamentally different approach. It stores data as discrete objects, each consisting of the data itself, metadata about the data, and a unique identifier.
There are no folders or hierarchy. You store an object with a key (like `users/avatars/user_42.jpg`) and retrieve it later with that same key. AWS S3, Google Cloud Storage, and Azure Blob Storage are object stores.
Object storage is designed for massive scale. S3 stores trillions of objects. It is cheap, durable (designed for 99.999999999% durability, that is eleven nines), and perfect for unstructured data like images, videos, backups, and log archives. The trade-off is latency. Object storage is not designed for low-latency random reads and writes the way block storage is. You would never run a database on object storage.
| Storage Type | Organizes Data As | Access Pattern | Latency | Best For |
|---|---|---|---|---|
| Block | Fixed-size blocks with addresses | Random read/write | Very low | Databases, OS volumes, VMs |
| File | Files in hierarchical directories | Sequential and shared access | Low to moderate | Shared files, media processing, logs |
| Object | Flat key-value objects with metadata | Write once, read many | Moderate to high | Images, videos, backups, archives, data lakes |
**Blob Storage for Unstructured Data**
Blob stands for Binary Large Object. It refers to any chunk of unstructured data: an image, a PDF, a video file, an audio recording, a machine learning model. Blob storage is essentially a practical use of object storage optimized for storing and retrieving large binary files.
When someone uploads a profile picture to your application, you do not store the image binary inside your database. You store it in blob storage (like S3) and save the URL or key in the database record. This keeps your database lean and fast. Your database handles structured queries efficiently, and blob storage handles large binary files cheaply.
This separation matters enormously at scale. A database storing 100 million user records with profile pictures embedded would be terabytes in size and painfully slow to query. The same database storing only text fields and a pointer to each image in S3 might be a few gigabytes and lightning fast.
**Distributed File Systems (HDFS, GFS)**
A distributed file system spreads data across many machines, making it possible to store and process datasets that are far too large for any single server.
Google File System (GFS) was one of the first, designed internally at Google to handle their enormous data processing needs. It splits large files into chunks (typically 64 MB each), replicates each chunk across multiple servers, and coordinates access through a master node that tracks where each chunk lives.
HDFS (Hadoop Distributed File System) is the open-source version inspired by GFS. It follows the same architecture: large files are split into blocks (128 MB by default), stored across a cluster of DataNodes, and managed by a central NameNode. HDFS is optimized for sequential reads of large files, which makes it ideal for batch processing jobs that scan through massive datasets. It is not good for random access to small files or low-latency reads.
Distributed file systems are the backbone of big data infrastructure. If you are running MapReduce, Spark, or Hive on terabytes of data, that data almost certainly lives on HDFS or a cloud-native equivalent like S3 with a Hadoop-compatible interface.
**Storage Area Networks (SAN) vs. Network-Attached Storage (NAS)**
SAN and NAS are two approaches to making storage available to servers over a network. They come up less frequently in modern cloud-native system design, but you will encounter them in enterprise environments and interviews.
A SAN provides block-level storage over a dedicated high-speed network (using protocols like Fibre Channel or iSCSI).
Each server sees the SAN storage as if it were a local disk attached directly to the machine. SANs are fast, reliable, and expensive. They are used for databases, virtual machine volumes, and any workload that demands block-level performance with shared storage.
A NAS provides file-level storage over a standard network (using protocols like NFS or SMB). It is essentially a file server that multiple machines can access.
NAS devices are simpler and cheaper than SANs and work well for shared file access, backups, and media storage.
| Aspect | SAN | NAS |
|---|---|---|
| Storage type | Block-level | File-level |
| Protocol | Fibre Channel, iSCSI | NFS, SMB |
| Performance | High (dedicated network) | Moderate (shared network) |
| Cost | High | Lower |
| Best for | Databases, VM volumes | Shared files, backups, media |
In cloud environments, the distinction has blurred. AWS EBS (block storage) and AWS EFS (file storage) provide SAN-like and NAS-like functionality respectively, without the physical hardware management.
Most cloud-native applications use object storage (S3) for bulk data and block storage (EBS) for databases, making traditional SAN/NAS architecture less common in new designs.
Interview-Style Question
> Q: Your application lets users upload videos up to 2 GB each. Where should you store these files?
> A: Store the video files in object storage like AWS S3. Store only the metadata (title, uploader, upload time, S3 key) in your relational database. Object storage is designed for large binary files: it is cheap, highly durable, and scales to petabytes without effort. Serving the videos can go through a CDN that pulls from S3, so users get fast delivery from an edge server close to them. Storing 2 GB video files directly in a database would bloat it, degrade query performance, and cost significantly more.
_Block vs. file vs. object storage_
### KEY TAKEAWAYS
* Block storage is fast and used for databases and operating systems. Object storage is cheap, durable, and built for massive-scale unstructured data. File storage enables shared access across servers.
* Store binary data (images, videos, files) in blob/object storage, not in your database. Store only a reference in the database. * Distributed file systems like HDFS split large files across many machines for big data processing. They are optimized for sequential reads, not random access. * SAN provides high-performance block storage over a dedicated network. NAS provides file storage over a standard network. In cloud environments, managed services replace both.
**2.2 Relational Databases (SQL)**
Relational databases have been the backbone of software systems for over four decades. They store data in tables with rows and columns, enforce relationships between tables, and guarantee data integrity through strict rules.
If your data has a clear structure and relationships matter, a relational database is almost always where you should start.
**ACID Properties: Atomicity, Consistency, Isolation, Durability**
ACID is the set of guarantees that every relational database provides. These four properties are what make relational databases trustworthy for critical data.
* Atomicity means a transaction is all or nothing. If a bank transfer moves $500 from account A to account B, both the debit and the credit must succeed. If the credit fails, the debit is rolled back. You never end up in a state where the money left account A but never arrived in account B.
* Consistency means every transaction brings the database from one valid state to another. If you have a rule that says account balances cannot go below zero, no transaction can violate that rule, no matter what. * Isolation means concurrent transactions do not interfere with each other. If two users buy the last item in stock at the same time, the database ensures only one of them succeeds. The other gets an error or a retry, not a corrupted inventory count. * Durability means once a transaction is committed, it stays committed even if the server crashes a millisecond later. The data is written to persistent storage, not just held in memory.
ACID is why banks, hospitals, airlines, and e-commerce platforms use relational databases for their most critical data. When you cannot afford to lose or corrupt data, ACID is non-negotiable.
**Schema Design and Normalization (1NF, 2NF, 3NF, BCNF)**
Normalization is the process of organizing your database tables to minimize redundancy and prevent data anomalies. Each "normal form" adds a stricter rule.
* First Normal Form (1NF): Each column contains a single atomic value. No lists or arrays stuffed into a single field. A column called "phone\_numbers" containing "555-1234, 555-5678" violates 1NF. Instead, you create a separate phone\_numbers table with one row per phone number.
* Second Normal Form (2NF): Everything in 1NF, plus every non-key column depends on the entire primary key, not just part of it. If your table has a composite key of (student\_id, course\_id) and a column called student\_name that depends only on student\_id, move student\_name to a separate students table. * Third Normal Form (3NF): Everything in 2NF, plus no non-key column depends on another non-key column. If you have an employees table with columns department\_id and department\_name, and department\_name depends on department\_id (not on the employee), move department\_name to a departments table. * Boyce-Codd Normal Form (BCNF): A stricter version of 3NF that handles certain edge cases involving functional dependencies. In practice, if your tables are in 3NF, they are usually in BCNF as well.
| Normal Form | Rule | Purpose |
|---|---|---|
| 1NF | Atomic values only, no repeating groups | Eliminate lists in columns |
| 2NF | All non-key columns depend on the full primary key | Eliminate partial dependencies |
| 3NF | No non-key column depends on another non-key column | Eliminate transitive dependencies |
| BCNF | Every determinant is a candidate key | Handle remaining dependency anomalies |
Most production databases aim for 3NF as the standard. Going further than BCNF (to 4NF or 5NF) is rarely necessary and adds complexity without proportional benefit.
**Denormalization: When and Why to Denormalize**
Normalization reduces redundancy, but it increases the number of tables and the number of joins your queries need.
A query that touches five tables to assemble a user's order history is slower than reading a single denormalized table that already has all the data in one place.
Denormalization deliberately introduces redundancy to improve read performance.
Instead of joining the orders table with the products table and the users table every time someone views their order history, you might store the product name and user name directly in the orders table.
The data is duplicated, but reads are faster because no joins are needed.
The trade-off is write complexity. When a product name changes, you need to update it in every order row that references it, not just in one place.
Denormalization trades write simplicity for read speed.
Denormalize when your system is read-heavy and join performance is a bottleneck. Keep data normalized when your system is write-heavy or when data integrity is more critical than read speed.
Many production systems start normalized and selectively denormalize specific tables or fields as performance requirements demand.
**Indexing: B-Trees, B+ Trees, Hash Indexes, Composite Indexes**
An index is a data structure that lets the database find rows quickly without scanning every row in the table.
Without an index, finding a user by email in a table of 10 million rows means checking all 10 million rows.
With an index on the email column, the database jumps directly to the matching row.
B-trees are the most common index structure. They organize data in a balanced tree where each node contains multiple keys, and searches work by traversing from the root to the leaf nodes. B-tree lookups take O(log n) time.
B+ trees are a variation where all actual data pointers live in the leaf nodes, and leaf nodes are linked together in a sorted sequence. This makes range queries (like "find all users created between January and March") very efficient because the database can start at the first matching leaf and scan forward through the linked list. Most relational databases (PostgreSQL, MySQL InnoDB) use B+ trees for their default indexes.
Hash indexes use a hash function to map key values directly to storage locations. Lookups for exact matches are O(1), faster than B-trees. But hash indexes cannot handle range queries at all. They only answer "find the row where email \= X," not "find all rows where created\_date \> Y."
Composite indexes cover multiple columns. An index on (last\_name, first\_name) speeds up queries that filter by last name alone or by both last name and first name. But it does not help queries that filter only by first name, because the index is ordered by last name first. Column order in a composite index matters.
| Index Type | Lookup Speed | Range Queries | Best For |
|---|---|---|---|
| B-tree | O(log n) | Yes | General-purpose, most queries |
| B+ tree | O(log n), fast sequential access | Excellent | Range queries, sorted results, default choice |
| Hash | O(1) for exact match | No | Exact lookups only (find by ID, email) |
| Composite | Depends on columns matched | Yes (for leftmost prefix) | Queries filtering on multiple columns |
Adding indexes is not free.
Every index speeds up reads but slows down writes, because the database must update the index every time a row is inserted, updated, or deleted.
Over-indexing a write-heavy table can destroy write performance. Index strategically based on your actual query patterns.
**SQL Query Optimization and Execution Plans**
Writing a SQL query tells the database what data you want.
The database engine decides how to get it.
The execution plan is the strategy the database chooses: which indexes to use, in what order to join tables, whether to do a sequential scan or an index scan.
You can inspect execution plans using EXPLAIN (PostgreSQL) or EXPLAIN ANALYZE to see timing information.
The most common problems you will find are full table scans where an index should be used, nested loop joins on large tables where a hash join would be faster, and sorting large result sets without an index that supports the sort order.
Practical optimization tips: add indexes on columns used in WHERE clauses and JOIN conditions.
Avoid SELECT \* when you only need specific columns.
Use LIMIT to avoid fetching millions of rows when you only need the first 10\.
Be cautious with functions on indexed columns (like WHERE YEAR(created\_at) \= 2025), because wrapping an indexed column in a function often prevents the database from using the index.
**Stored Procedures, Views, Materialized Views**
Stored procedures are precompiled SQL programs stored in the database and executed on the server side. They reduce network round trips by performing complex operations in a single call. The downside is that business logic embedded in stored procedures is harder to version control, test, and debug compared to application code. Most modern systems keep logic in the application layer and use the database purely for storage and retrieval.
Views are saved SQL queries that act like virtual tables. A view does not store data; it runs the underlying query every time you select from it. Views simplify complex queries and restrict access to sensitive columns. But since the query runs on every access, views on complex joins can be slow.
Materialized views store the result of a query physically on disk. They are precomputed, so reading from a materialized view is as fast as reading from a regular table.
The trade-off is freshness: the materialized view is a snapshot. You need to refresh it periodically (or on a trigger) to keep it current.
Materialized views are excellent for dashboards, reports, and any query that is expensive to compute but does not need to reflect the latest second of data.
**Transactions and Isolation Levels**
A transaction groups multiple database operations into a single logical unit. Either all operations succeed (commit) or none of them do (rollback). But when multiple transactions run concurrently, the isolation level determines how much they can see of each other's uncommitted work.
Read Uncommitted: A transaction can see uncommitted changes from other transactions. This is fast but dangerous because you might read data that gets rolled back (a "dirty read"). Almost never used in production.
Read Committed: A transaction only sees changes that have been committed. No dirty reads, but if you read the same row twice within your transaction, the value might change between reads because another transaction committed in between (a "non-repeatable read"). This is the default in PostgreSQL.
Repeatable Read: Once a transaction reads a row, it sees the same value for that row for the duration of the transaction, even if another transaction modifies and commits it. This prevents non-repeatable reads but can still allow "phantom reads" where new rows appear in a range query. This is the default in MySQL InnoDB.
Serializable: The strictest level. Transactions execute as if they ran one at a time in sequence. No dirty reads, no non-repeatable reads, no phantom reads. The cost is performance: the database uses more locks or more complex concurrency control, which reduces throughput.
| Isolation Level | Dirty Reads | Non-Repeatable Reads | Phantom Reads | Performance |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Fastest |
| Read Committed | Prevented | Possible | Possible | Good |
| Repeatable Read | Prevented | Prevented | Possible | Moderate |
| Serializable | Prevented | Prevented | Prevented | Slowest |
Choose the isolation level based on your consistency needs. Financial transactions typically need Serializable.
A dashboard reading analytics data can comfortably use Read Committed.
Interview-Style Question
> Q: You are building an e-commerce system. A user places an order, which involves reducing inventory and creating an order record. How do you ensure these two operations stay in sync?
> A: Wrap both operations in a single database transaction. The transaction deducts the inventory count and inserts the order record. If either operation fails, the entire transaction rolls back, and neither change is applied. Use at least Read Committed isolation to prevent reading uncommitted inventory counts from other transactions. If you need to prevent two users from buying the last item simultaneously, use Serializable isolation or a SELECT FOR UPDATE lock on the inventory row to ensure only one transaction can modify it at a time.
_Normalization_
### KEY TAKEAWAYS
* ACID properties (Atomicity, Consistency, Isolation, Durability) guarantee data integrity. They are the reason relational databases are trusted for critical data. * Normalize to 3NF to eliminate redundancy. Denormalize selectively when read performance matters more than write simplicity. * Indexes dramatically speed up reads but slow down writes. B+ trees are the default. Use composite indexes for multi-column queries. Do not over-index write-heavy tables. * Use EXPLAIN to understand your query execution plans. Full table scans, inefficient joins, and missing indexes are the most common performance killers.
* Materialized views precompute expensive queries for fast reads but require periodic refreshes. * Choose your isolation level based on your consistency requirements. Higher isolation means more safety but less throughput.
**2.3 NoSQL Databases**
Relational databases are powerful, but they are not the right tool for every job.
When your data does not fit neatly into rows and columns, when you need horizontal scalability across hundreds of machines, or when your schema changes rapidly, NoSQL databases offer alternatives that relational databases cannot match.
NoSQL is not a single technology. It is a family of databases, each designed for a specific data model and access pattern.
Choosing the right one requires understanding what each type does well and where it falls short.
**Key-Value Stores (Redis, DynamoDB, Memcached)**
A key-value store is the simplest type of database. You store data with a key and retrieve it with that same key. Think of it as a giant distributed HashMap.
Redis is the most popular key-value store. It keeps data in memory for microsecond-level reads and supports rich data structures: strings, lists, sets, sorted sets, hashes, and streams. Redis is used for caching, session storage, real-time leaderboards, rate limiting, and pub/sub messaging. It can optionally persist data to disk, but its primary strength is speed from in-memory storage.
DynamoDB is Amazon's managed key-value (and document) database. It provides single-digit millisecond performance at any scale, with automatic replication across multiple availability zones. DynamoDB is a strong choice when you need a managed, highly available key-value store and your access patterns are well-defined (lookups by primary key or a combination of partition key and sort key).
Memcached is a simpler, pure-caching key-value store. It stores strings in memory and nothing else. No persistence, no complex data structures. What it lacks in features it makes up for in raw throughput for basic cache workloads.
Key-value stores excel at simple lookups: "give me the user session for token ABC" or "what is the cached result for query XYZ." They struggle when you need to query data by anything other than the key, or when you need relationships between records.
**Document Stores (MongoDB, CouchDB, Firestore)**
Document stores extend the key-value model by storing structured documents (usually JSON or BSON) as values. Unlike key-value stores, document databases understand the internal structure of the document and can query, index, and filter on fields within it.
MongoDB stores documents in collections (the NoSQL equivalent of tables). A user document might look like `{"name": "Alex", "email": "alex@example.com", "orders": [...]}`. You can query on any field, create indexes on nested fields, and even embed related data directly within the document.
Document stores shine when your data is hierarchical or semi-structured. A blog post with comments, a product listing with variable attributes (a laptop has different specs than a t-shirt), or user profiles with optional fields all map naturally to documents. Schema flexibility means you can add new fields without altering a table structure or running migrations.
The trade-off is that relationships between documents are weaker than in relational databases. If you need to join data across collections frequently, you end up either embedding data (which duplicates it) or performing multiple queries (which is slower than a SQL join).
**Wide-Column Stores (Cassandra, HBase, Bigtable)**
Wide-column stores organize data into rows and columns, but unlike relational databases, each row can have a different set of columns. Columns are grouped into column families, and data within a family is stored together on disk for efficient retrieval.
Cassandra was originally built at Facebook for inbox search and later open-sourced. It provides massive write throughput, linear horizontal scalability, and multi-datacenter replication with no single point of failure. Every node in a Cassandra cluster is equal (no master node), and data is distributed using consistent hashing.
HBase runs on top of HDFS and provides random read/write access to data stored in the Hadoop ecosystem. Bigtable is Google's proprietary wide-column store that inspired HBase.
Wide-column stores are built for write-heavy workloads at enormous scale: event logging, time-series data, IoT sensor data, messaging history. They handle millions of writes per second across hundreds of nodes.
The trade-off is limited query flexibility.
You must design your data model around your query patterns upfront.
Cassandra queries are efficient only when they match the partition key and clustering column structure.
Ad-hoc queries and joins are either impossible or extremely slow.
**Graph Databases (Neo4j, Amazon Neptune, FlockDB)**
Graph databases store data as nodes (entities) and edges (relationships between entities). They are optimized for traversing connections, which makes them ideal for data where relationships are the primary focus.
Neo4j is the most established graph database. It uses a query language called Cypher that reads almost like English: `MATCH (user)-[:FOLLOWS]->(friend) WHERE user.name = "Alex" RETURN friend`.
Graph databases are the right choice for social networks (who follows whom, mutual friends), recommendation engines (users who bought X also bought Y), fraud detection (finding suspicious transaction patterns), and knowledge graphs (how concepts relate to each other).
The trade-off is that graph databases are not designed for bulk data processing or simple key-value lookups. They excel at relationship queries but are overkill for use cases where a relational or document database would work fine.
**Time-Series Databases (InfluxDB, TimescaleDB, Prometheus)**
Time-series databases are optimized for data that is indexed and queried primarily by time.
Metrics, sensor readings, stock prices, application logs, and IoT data all follow a pattern: data arrives in chronological order, recent data is queried far more than old data, and aggregations over time windows (averages, sums, counts per minute) are the most common operations.
InfluxDB is purpose-built for time-series data with its own query language (Flux) and built-in retention policies that automatically delete old data.
TimescaleDB is a PostgreSQL extension that adds time-series optimizations while keeping full SQL compatibility.
Prometheus is the standard for infrastructure monitoring metrics, storing time-series data from application and system exporters.
Time-series databases optimize for high write throughput (millions of data points per second), time-range queries, and automatic data compaction and downsampling (storing minute-level detail for recent data but only hourly averages for data older than 30 days).
**Vector Databases (Pinecone, Weaviate, Milvus, pgvector)**
Vector databases are the newest addition to the NoSQL family, driven by the explosion of AI and machine learning applications. They store and search high-dimensional vectors, the numerical representations (embeddings) that AI models produce from text, images, audio, or other data.
When you search "comfortable running shoes for flat feet" on an AI-powered platform, the system converts your query into a vector and finds the product vectors that are closest in meaning, not just matching keywords.
This similarity search is what vector databases are built for.
Pinecone is a fully managed vector database optimized for production AI applications.
Weaviate and Milvus are open-source options with rich feature sets. pgvector is a PostgreSQL extension that adds vector search to an existing PostgreSQL database, which is convenient when you want to keep your vectors alongside your relational data without running a separate database.
Vector databases are essential for RAG (Retrieval-Augmented Generation) pipelines, semantic search, recommendation systems, and image similarity search. They are covered in more depth in Part VII, Lesson 2 (Generative AI & LLM System Design).
**BASE Properties: Basically Available, Soft State, Eventual Consistency**
Where relational databases follow ACID, most NoSQL databases follow BASE. This is not an inferior guarantee. It is a different set of trade-offs designed for distributed systems at scale.
Basically Available means the system remains available even during failures. It might return stale data or partial results, but it will not refuse to answer.
Soft State means the system's state can change over time even without new input, as updates propagate across replicas.
Eventual Consistency means that after a write, all replicas will eventually converge to the same value, but there may be a brief window where different replicas return different results.
BASE allows NoSQL databases to achieve horizontal scalability and high availability at the cost of immediate consistency.
For many applications, this trade-off is perfectly acceptable.
A social media post that takes 500 milliseconds to appear on a friend's feed is not a problem. A bank balance that takes 500 milliseconds to reflect a withdrawal is a problem.
**SQL vs. NoSQL: Decision Framework**
This is one of the most common questions in system design interviews, and the answer is never "SQL is better" or "NoSQL is better." It depends on your data, your access patterns, and your scale requirements.
| Criteria | Choose SQL | Choose NoSQL |
|---|---|---|
| Data structure | Well-defined, stable schema | Flexible, evolving, or varied schema |
| Relationships | Complex relationships, frequent joins | Minimal relationships, self-contained records |
| Consistency | Strong consistency is required | Eventual consistency is acceptable |
| Scale | Moderate scale, can scale vertically | Massive scale, needs horizontal distribution |
| Query patterns | Complex queries, ad-hoc reporting | Simple lookups by key, known access patterns |
| Examples | Financial systems, inventory, ERP | User sessions, real-time analytics, content catalogs |
Many production systems use both. An e-commerce platform might use PostgreSQL for orders and inventory (where ACID matters), Redis for session caching (where speed matters), Elasticsearch for product search (where full-text queries matter), and S3 for product images (where cheap storage matters). Picking the right database for each job is far more valuable than committing to a single technology for everything.
Interview-Style Question
> Q: You are designing a social media platform where users create posts, follow other users, and see a personalized feed. Which database(s) would you use?
> A: Use multiple databases, each for what it does best. A relational database (PostgreSQL) for core user data: profiles, authentication, and relationships, because these need consistency and support complex queries. A wide-column store (Cassandra) or document store for the feed timeline, because feeds are write-heavy (every post fans out to followers), read-heavy (every user loads their feed frequently), and benefit from horizontal scaling. Redis for caching hot feeds and session data. A graph database (Neo4j) could be added for advanced friend-of-friend recommendations, though many teams compute these offline in batch jobs instead.
**KEY TAKEAWAYS**
* NoSQL is not one technology. Key-value, document, wide-column, graph, time-series, and vector databases each solve different problems. * Key-value stores are fastest for simple lookups. Document stores offer schema flexibility. Wide-column stores handle massive write throughput. Graph databases excel at relationship traversal. * Time-series databases are purpose-built for metrics, IoT, and monitoring data. Vector databases power AI-driven similarity search.
* BASE (Basically Available, Soft State, Eventual Consistency) trades immediate consistency for scalability and availability. * Most production systems use multiple database types. Let your data model and access patterns drive the choice, not personal preference or popularity.
**2.4 Database Scaling Techniques**
Your database worked perfectly when you had 1,000 users. Now you have 10 million, and queries that used to take 5 milliseconds take 5 seconds.
The database is running out of memory.
Replication lag is growing.
The on-call engineer gets paged every night. It is time to scale.
**Vertical Scaling vs. Horizontal Scaling for Databases**
Vertical scaling (scaling up) means giving your existing database server more power: more CPU, more RAM, faster SSDs. You take a machine with 16 GB of RAM and replace it with one that has 256 GB.
Queries run faster because more data fits in memory.
Writes are faster because the CPU is more powerful.
Vertical scaling is the simplest approach because nothing about your application changes. Same database, same connection string, same queries.
The ceiling is the biggest machine available. At some point, no amount of hardware can keep up with your growth, and the biggest machines are disproportionately expensive.
Horizontal scaling (scaling out) means spreading your data and queries across multiple machines.
Instead of one powerful database server, you have ten or fifty smaller ones, each handling a portion of the workload. This is harder to implement but has no theoretical ceiling. You can always add more machines.
Most teams start with vertical scaling because it buys time without complexity.
When vertical scaling hits its limits, you shift to horizontal scaling through read replicas, sharding, or federation.
**Read Replicas: Master-Slave Replication**
Read replicas are the first horizontal scaling technique most teams adopt because they are relatively simple to set up.
You have one primary database (the master) that handles all writes.
One or more replica databases (slaves) receive a copy of every write from the primary and serve read queries.
Since most applications are read-heavy (often 80% to 95% of queries are reads), offloading reads to replicas dramatically reduces the load on the primary.
Your application sends writes to the primary and reads to any available replica. A load balancer or a smart database driver can distribute read queries across replicas automatically.
The trade-off is replication lag.
Writes hit the primary first, and there is a small delay (milliseconds to seconds, depending on the setup) before replicas receive the update.
During that window, a user might write a comment and then not see it on the next page load because the read hit a replica that has not received the write yet.
For most applications, this brief lag is acceptable.
For operations where users must see their own writes immediately, you can route those specific reads to the primary.
**Multi-Master Replication and Conflict Resolution**
Multi-master replication allows multiple database nodes to accept writes. This improves write availability (if one master goes down, the others keep accepting writes) and enables geographically distributed writes (a user in Europe writes to a European master while a user in Asia writes to an Asian master).
The challenge is conflict resolution.
What happens if two masters receive conflicting writes at the same time?
User A updates their email to "alex@new.com" on master 1, while user B (or even user A from another device) updates it to "alex@other.com" on master 2 at the same instant.
Conflict resolution strategies include last-write-wins (the most recent timestamp takes precedence, simple but can lose data), application-level resolution (the application gets both versions and decides which one to keep based on business logic), and CRDTs (Conflict-free Replicated Data Types, data structures designed to merge automatically without conflicts).
Multi-master replication is significantly more complex than single-master replication. Use it only when you genuinely need multi-region write availability.
For most applications, a single primary with read replicas is sufficient.
**Database Sharding: Strategies, Consistent Hashing, Resharding**
Sharding splits your database horizontally.
Instead of one database holding all 500 million user records, you split them across 10 shards, each holding roughly 50 million records. Each shard is an independent database running on its own server.
The sharding key determines which record goes to which shard.
Common strategies include hash-based sharding (hash the user ID, modulo the number of shards), range-based sharding (users with IDs 1 to 50 million go to shard 1, 50 million to 100 million go to shard 2), and geography-based sharding (European users on one shard, Asian users on another).
Hash-based sharding distributes data evenly but makes range queries across all shards expensive.
Range-based sharding supports range queries efficiently but can create hot spots if one range gets more traffic than others.
Consistent hashing (covered in Part II, Lesson 4\) solves the resharding problem.
With simple modulo-based sharding, adding a new shard changes the shard assignment for most records, requiring a massive data migration.
With consistent hashing, adding a shard only redistributes a small fraction of records from neighboring shards.
Resharding (adding or removing shards) is one of the most painful operations in a distributed system. It requires migrating data between shards while the system is still running, ensuring no data is lost or duplicated, and updating routing logic to reflect the new shard layout.
This is why teams try to choose a sharding strategy that minimizes the need for resharding.
Some teams over-provision shards from the start (creating 256 logical shards even if they initially live on only 4 physical servers) so that resharding means moving logical shards between physical servers rather than splitting and migrating data.
**Federation (Functional Partitioning)**
Federation splits your database by function rather than by rows. Instead of one database holding users, orders, products, and analytics, you create separate databases for each domain: a users database, an orders database, a products database.
Each database can be independently scaled, optimized, and maintained. The users database might be a PostgreSQL instance with heavy indexing on email lookups.
The analytics database might be a ClickHouse instance optimized for aggregation queries. Each database handles less traffic because it only serves its own domain.
The limitation is cross-domain queries.
If you need to join user data with order data, you cannot do it with a single SQL query because the data lives in different databases. You either denormalize (store user name in the orders database) or perform the join in your application code.
Federation aligns naturally with microservices architecture, where each service owns its own data. It is less common in monolithic applications where all data traditionally lives in one database.
**Connection Pooling and Query Caching**
These are two optimizations that squeeze more performance out of your existing database before you need to add replicas or shards.
Connection pooling maintains a pool of open database connections that your application reuses instead of opening and closing connections per request. Opening a new database connection involves a TCP handshake, authentication, and session setup, which can take 20 to 50 milliseconds.
With connection pooling, a request grabs an already-open connection from the pool, uses it, and returns it.
Tools like PgBouncer (for PostgreSQL) and ProxySQL (for MySQL) manage connection pools and can reduce connection overhead by an order of magnitude.
Query caching stores the results of frequent database queries so they can be returned without re-executing the query. Application-level caching with Redis (covered in Part II, Lesson 3\) is the most common approach.
Some databases have built-in query caches, but external caching gives you finer control over what is cached, TTLs, and invalidation.
Both techniques are low-effort, high-reward optimizations. Implement connection pooling and caching before you invest in replicas or sharding. They often buy enough headroom to delay more complex scaling work by months.
Interview-Style Question
> Q: Your user table has 500 million rows and queries are getting slow. Walk through how you would approach scaling this database.
> A: Start with the cheapest optimizations first. Add or optimize indexes on frequently queried columns. Implement connection pooling if it is not already in place. Add application-level caching with Redis for frequent lookups like user profiles. If those measures are not enough, add read replicas to offload read traffic from the primary. If the primary write throughput is also a bottleneck, shard the database by user ID using consistent hashing. With 500 million users, 16 to 32 shards would bring each shard down to 15 to 30 million rows, which is very manageable. Plan for resharding by starting with a high number of logical shards mapped to fewer physical servers, so future growth means moving logical shards rather than splitting data.
**KEY TAKEAWAYS**
* Start with vertical scaling and optimization (indexes, connection pooling, caching) before adding horizontal complexity. * Read replicas offload read traffic from the primary. Accept small replication lag or route critical reads to the primary. * Multi-master replication enables multi-region writes but introduces conflict resolution complexity. Use it only when you genuinely need it. * Sharding splits data across multiple database servers. Choose your shard key carefully because it determines query efficiency and data distribution.
* Federation splits databases by function (users, orders, products). It aligns with microservices but makes cross-domain queries harder. * Connection pooling and query caching are low-effort optimizations that should be implemented before any horizontal scaling.
**2.5 Data Modeling**
All the storage technologies and scaling techniques in the world will not save you if your data model is wrong. How you structure your data determines how fast your queries run, how easily your system scales, and how painful it is to add new features six months from now.
**Schema Design for Different Use Cases**
There is no one-size-fits-all schema design. The right structure depends entirely on how the data will be accessed.
For a transactional system (e-commerce orders, banking), normalize aggressively. Each entity gets its own table. Relationships are explicit through foreign keys. This minimizes redundancy and makes writes safe and consistent. Queries involve joins, but transactional databases handle joins well when properly indexed.
For a read-heavy analytical system (dashboards, reports), denormalize. Precompute joins and store the results in wide tables or materialized views. Analytical queries scan large amounts of data and perform aggregations. Having all the data in one table eliminates join overhead and makes those scans faster.
For a document-based system (content management, user profiles), embed related data within the document. A blog post document might include the title, body, author name, tags, and comments all in one object. This eliminates the need for joins and makes reads fast because everything you need is in one place. Updates to embedded data (like changing an author's name across all their posts) become the pain point.
For a time-series system (metrics, IoT), design around time partitions. Data arrives in chronological order and is queried by time ranges. Partition by time intervals (hourly, daily, weekly) so that queries that ask "show me metrics from the last 24 hours" only touch recent partitions and ignore the rest.
**Entity-Relationship Modeling**
Entity-relationship (ER) modeling is the process of identifying the entities in your system, their attributes, and the relationships between them. It is the first step in designing a relational schema.
An entity is a thing your system tracks: a user, an order, a product, a payment. Each entity has attributes: a user has a name, an email, and a creation date. Relationships connect entities: a user places many orders (one-to-many), an order contains many products and a product can be in many orders (many-to-many).
The three relationship types you will use repeatedly are one-to-one (a user has one settings record), one-to-many (a user has many orders), and many-to-many (students enroll in many courses, each course has many students).
Many-to-many relationships require a junction table (also called a join table or bridge table). A student\_courses table with columns student\_id and course\_id connects the students and courses tables.
ER modeling forces you to think clearly about what your system stores and how the pieces relate before you write any code. In an interview, sketching an ER diagram on the whiteboard shows the interviewer that you think about data systematically rather than jumping straight to API design.
**Data Partitioning Strategies: Range, Hash, List, Composite**
When your dataset grows beyond what a single machine can handle, you partition it. The partitioning strategy determines how data is distributed across partitions.
Range partitioning assigns rows to partitions based on value ranges. Orders from January go to partition 1, February to partition 2, and so on. Range partitioning is natural for time-ordered data and supports efficient range queries. The risk is uneven distribution: if December has 10x the orders of July, the December partition is overloaded.
Hash partitioning applies a hash function to a column value and assigns rows to partitions based on the hash result. This distributes data evenly regardless of the actual values, eliminating hot partitions. The downside is that range queries become expensive because related rows may scatter across different partitions.
List partitioning assigns rows to specific partitions based on discrete values. Users in the US go to partition A, users in Europe go to partition B, users in Asia go to partition C. This is useful when you want geographic isolation or when specific values naturally group together.
Composite partitioning combines two strategies. You might range-partition by date and then hash-partition within each date range by user ID. This gives you efficient time-range queries while distributing the data evenly within each time range.
| Strategy | Distributes By | Even Distribution? | Range Queries? | Best For |
|---|---|---|---|---|
| Range | Value ranges | Can be uneven | Excellent | Time-series, ordered data |
| Hash | Hash of a column | Very even | Poor (data scattered) | User data, session data |
| List | Explicit value mapping | Depends on data | Within partition only | Geographic data, categorical data |
| Composite | Two strategies combined | Good | Good for primary partition key | Large-scale systems needing both even distribution and range queries |
**Choosing the Right Database for the Job**
Picking a database is not an academic exercise. It is a decision that your team will live with for years. Here is a practical framework.
First, ask what kind of data you are storing. Structured data with relationships points to SQL. Semi-structured or hierarchical data points to document stores. Time-ordered data points to time-series databases. High-dimensional embeddings point to vector databases.
Second, ask what your access patterns look like. Simple key-based lookups favor key-value stores. Complex queries with joins favor relational databases. Relationship traversals favor graph databases. Full-text search favors search engines like Elasticsearch.
Third, ask what your scale and availability requirements are. If you need to handle hundreds of thousands of writes per second across multiple regions, Cassandra or DynamoDB are built for that. If you have moderate scale and need strong consistency, PostgreSQL with read replicas is often enough.
Fourth, ask what your team knows. A database your team has deep experience with will outperform a theoretically better database that nobody knows how to operate, tune, or troubleshoot. Operational expertise matters more than technical specifications on paper.
Fifth, accept that you will likely use multiple databases. Very few production systems of any significant scale rely on a single database technology. Use the right tool for each job. Let PostgreSQL handle your transactional data. Let Redis handle your caching. Let Elasticsearch handle your search. Let S3 handle your files. Each technology does its job better than any single database could do all of them.
Interview-Style Question
> Q: You are designing a ride-sharing service like Uber. What databases would you use and why?
> A: Multiple databases for different needs. PostgreSQL for core transactional data: user accounts, payment methods, ride history, and driver profiles, where ACID compliance matters for payments and billing. Redis for real-time data: driver locations updated every few seconds, session tokens, and surge pricing calculations that need sub-millisecond access. A time-series database (like InfluxDB or TimescaleDB) for operational metrics: ride request rates, average wait times, driver utilization. A spatial database or geospatial index (PostGIS extension on PostgreSQL, or a dedicated solution) for proximity queries like "find all available drivers within 2 kilometers." Cassandra or DynamoDB for the location event stream if the write volume is extremely high (millions of location updates per second across all active drivers).
_Entity Relationship Diagram_
### KEY TAKEAWAYS
* Schema design should follow your access patterns. Normalize for transactional systems. Denormalize for analytical systems. Embed for document systems.
* Entity-relationship modeling clarifies your data structure before you write code. Identify entities, attributes, and relationships first. * Range partitioning is efficient for time-ordered data. Hash partitioning distributes data evenly. Composite partitioning combines both strengths. * Choose databases based on data type, access patterns, scale needs, and your team's operational expertise, in that order. * Most production systems use multiple databases. Use each technology for what it does best rather than forcing one tool to do everything.
> Up Next: Your system can now store and retrieve data efficiently across multiple storage technologies. But the difference between a good system and a great one is often speed, and nothing improves speed like caching. Part II, Lesson 3 covers caching fundamentals, strategies, eviction policies, and the real-world challenges that make caching one of the hardest problems to get right.