3.1 Scalability - The System Design Interview Handbook

Scaling Strategies

Scalability is your system's ability to handle increasing load without degrading performance.

A system that responds in 100 milliseconds with 1,000 users but takes 10 seconds with 100,000 users is not scalable.

A system that maintains its performance characteristics as load grows, either by adding resources or by distributing work more efficiently, is.

Vertical Scaling (Scale Up) vs. Horizontal Scaling (Scale Out)

Vertical scaling means making your existing machine more powerful. You upgrade from 8 GB of RAM to 64 GB. You swap a 4-core CPU for a 32-core one. You replace spinning hard drives with NVMe SSDs.

The application does not change.

The same code runs on a beefier machine, and everything gets faster.

Horizontal scaling means adding more machines instead of upgrading existing ones.

Instead of one server handling all requests, you run ten servers behind a load balancer, each handling a tenth of the traffic.

Instead of one database storing all your data, you shard it across four database servers, each storing a quarter.

Aspect	Vertical Scaling	Horizontal Scaling
What changes	Bigger machine	More machines
Application changes	None	Often significant (stateless design, sharding)
Cost curve	Exponential (2x power costs 3-4x price)	Linear (2x machines costs ~2x price)
Ceiling	Limited by largest available machine	No theoretical limit
Downtime	Usually requires restart	Can add machines with zero downtime
Complexity	Low	Higher (distributed systems challenges)

Vertical scaling is the easiest first response to performance problems because it requires zero code changes. But it has a hard ceiling.

The most powerful single server you can buy in 2026 still cannot handle what ten modest servers can handle together.

And as you approach the top end of available hardware, costs escalate dramatically.

A server with 1 TB of RAM costs far more than ten servers with 100 GB each.

Horizontal scaling has no theoretical ceiling. You can always add another machine. But it introduces distributed systems challenges: data consistency across machines, network communication overhead, service discovery, and the need for stateless application design.

Moving from one server to two is the hardest jump.

Moving from two to twenty is comparatively easier because you have already solved the distribution problems.

Choosing the Right Scaling Approach for Your Application

The decision is not binary. Most production systems use both strategies at different layers.

Your application servers are usually the easiest to scale horizontally.

Make them stateless (covered below), put a load balancer in front, and add instances as traffic grows. Your database is usually harder to scale horizontally. Read replicas are a mild form of horizontal scaling.

Full sharding is a major engineering effort. Many teams vertically scale their database (bigger machine, more RAM for caching) as long as possible before committing to horizontal database scaling.

A practical decision framework: start with vertical scaling for your database and horizontal scaling for your application tier.

Vertically scale the database until costs become unreasonable or performance plateaus despite hardware upgrades. Then introduce read replicas.

Shard only when read replicas are not enough and write throughput is the bottleneck.

For application servers, start with two instances behind a load balancer from day one.

Not because you need two for capacity, but because it forces you to solve the statelessness problem early and gives you redundancy against single-machine failures.

Adding more instances later is trivial once the architecture supports it.

Stateless Services and Session Management

A service is stateless when any instance can handle any request without relying on data from a previous request. The server does not remember anything about the client between calls. Every request carries all the information the server needs to process it.

This is the single most enabling property for horizontal scaling.

If any server can handle any request, the load balancer can send each request to whichever server is least busy.

Adding a new server is instant because it does not need to sync state from existing servers. Removing a server has no impact because it is not holding any unique data.

The enemy of statelessness is server-local session data.

When a user logs in and the server stores their session in local memory, that user's requests must always go to that specific server (sticky sessions).

Sticky sessions prevent free load distribution, make scaling harder, and cause session loss when a server goes down.

The fix is to externalize session state.

Store sessions in a shared external store that all servers can access. Redis is the most common choice for session storage because of its low latency and built-in TTL support. Every server reads from and writes to the same Redis instance.

The user can hit any application server on any request, and their session is always available.

Other state that must be externalized includes file uploads (store in object storage like S3, not on the local filesystem), in-memory caches that all instances need (use a distributed cache like Redis), and scheduled tasks (use a centralized scheduler, not a cron job on each server).

Database Scaling Strategies Recap

Database scaling was covered in depth in Chapter II. Here is a concise summary of the progression.

Step 1: Optimize what you have. Add indexes, optimize queries, implement connection pooling. This is free performance that many teams skip.

Step 2: Vertically scale. More RAM means more data fits in the database's buffer pool, reducing disk reads. Faster CPUs speed up query processing. This buys time without complexity.

Step 3: Add a caching layer. Put Redis between your application and database. Cache frequently read data so most requests never reach the database. This often reduces database load by 80% or more.

Step 4: Add read replicas. Offload read queries to replicas. Keep writes on the primary. Works well for read-heavy workloads.

Step 5: Shard the database. Split data across multiple database servers by a partition key. This is the nuclear option. It scales writes and storage but adds significant application complexity (cross-shard queries, resharding, data distribution management).

Each step is more complex and expensive than the previous one. Exhaust the cheaper options before escalating to the next level.

Auto-Scaling: Policies, Metrics, and Cooldowns

Auto-scaling automatically adds or removes server instances based on real-time metrics. Instead of manually spinning up servers before a traffic spike, the auto-scaler detects the spike and adds capacity within minutes.

Scaling policies define the rules. A target tracking policy might say "keep average CPU utilization across all instances at 60%." If utilization rises above 60%, add instances. If it drops below, remove them. A step scaling policy defines thresholds: "if CPU exceeds 70%, add 2 instances; if it exceeds 90%, add 5 instances."

Scaling metrics are the signals the auto-scaler watches. Common metrics include CPU utilization, memory usage, request count per instance, request latency, and custom metrics like queue depth. The best metric depends on your workload. CPU-bound applications scale well on CPU utilization. I/O-bound applications might scale better on request latency or queue depth because their CPU stays low even when they are overloaded.

Cooldown periods prevent the auto-scaler from making rapid, oscillating changes. After scaling out (adding instances), the cooldown period (typically 3 to 5 minutes) prevents further scaling actions until the new instances are warmed up and handling traffic. Without a cooldown, the auto-scaler might see high CPU, add 10 instances, then see low CPU because the new instances are still starting up, remove 8 instances, see high CPU again, and add more. This oscillation wastes resources and destabilizes the system.

Predictive scaling goes a step further by learning your traffic patterns and scaling proactively. If your system consistently gets a traffic surge every Monday at 9 AM, predictive scaling starts adding instances at 8:45 AM before the surge arrives. AWS Auto Scaling, Google Cloud Autoscaler, and Kubernetes Horizontal Pod Autoscaler all support some form of predictive scaling.

Policy Type	How It Works	Best For
Target tracking	Maintain a metric at a target value	Steady, predictable scaling
Step scaling	Add/remove specific amounts at specific thresholds	Aggressive response to sudden spikes
Scheduled scaling	Scale at predetermined times	Known traffic patterns (business hours, events)
Predictive scaling	ML-based forecasting of future traffic	Recurring traffic patterns

Interview-Style Question

Q: Your web application auto-scales based on CPU utilization with a target of 60%. After a marketing campaign, traffic spikes 5x. The auto-scaler adds instances, but users still experience slow responses for several minutes. What is going wrong and how do you fix it?

A: Two likely issues. First, instance startup time. New instances need to boot, initialize the application, warm up caches, and establish database connections. This takes 2 to 5 minutes during which the instance is not handling traffic but the auto-scaler counts it as capacity. Fix: use pre-built AMIs or container images with everything pre-installed, and implement readiness probes that only route traffic to an instance after it is fully ready. Second, the scaling policy may be too conservative. A target tracking policy at 60% CPU reacts after utilization exceeds the target, which is already too late during a 5x spike. Fix: add a scheduled scaling action before known campaigns, use step scaling for faster reaction (add 10 instances immediately when CPU exceeds 80% rather than gradually adding 1 or 2), and lower the target to 40% so scaling triggers earlier.

Autoscaling Timeline

KEY TAKEAWAYS

Vertical scaling is simple but has a ceiling. Horizontal scaling is unlimited but requires distributed design. Use both strategically at different layers.
Make application servers stateless from the start. Externalize sessions to Redis, files to S3, and caches to a distributed cache.
Scale your database in stages: optimize, vertically scale, add caching, add read replicas, and shard only when necessary.
Auto-scaling adds capacity automatically based on metrics. Choose the right metric for your workload, set appropriate cooldown periods, and pre-scale for known traffic events.
Instance startup time is the hidden cost of auto-scaling. Use pre-built images, readiness probes, and proactive scaling to minimize the gap between a traffic spike and available capacity.

Scalability Bottlenecks

Scaling a system is not just about adding resources. It is about finding the constraints that prevent your system from handling more load and eliminating them.

A bottleneck is any component whose capacity limits the throughput of the entire system.

Adding servers does nothing if the bottleneck is your database. Adding database replicas does nothing if the bottleneck is a single lock in your application code.

Identifying Bottlenecks: CPU, Memory, I/O, Network

Every server has four fundamental resources, and any one of them can become the constraint that limits everything else.

CPU bottlenecks show up when your processors are running at or near 100% utilization. Symptoms include rising response times that correlate with CPU usage, and request queuing because all CPU cores are busy. CPU-bound workloads include image processing, video transcoding, encryption, data compression, and complex computations. Fix CPU bottlenecks by optimizing algorithms, caching computed results, or scaling horizontally to spread the computation across more cores.
Memory bottlenecks appear when your server runs out of RAM. The operating system starts swapping to disk (which is orders of magnitude slower than RAM), garbage collection pauses become longer and more frequent, and processes get killed by the OOM (Out of Memory) killer. Memory issues often stem from caching too much data in-process, loading entire datasets into memory when only a subset is needed, or memory leaks that accumulate over time. Fix by profiling memory allocation, reducing cache sizes, fixing leaks, or vertically scaling to more RAM.
I/O bottlenecks involve disk or network I/O. Disk I/O bottlenecks are common with databases, where queries wait for disk reads. Network I/O bottlenecks appear when a service makes many outbound calls or transfers large payloads. Symptoms include high disk wait times (iowait), slow queries despite low CPU usage, and timeouts on network calls. Fix disk I/O with SSDs, database indexing, and caching. Fix network I/O with connection pooling, payload compression, and reducing the number of outbound calls through batching.
Network bottlenecks specifically relate to bandwidth saturation. If your servers have a 1 Gbps network link and you are pushing 900 Mbps of traffic, adding more application logic does not help. You need more bandwidth. Network bottlenecks are common when serving large files (video, images) without a CDN, when inter-service communication is excessive (chatty microservices), or when data replication between regions consumes bandwidth.

Resource	Symptoms	Common Causes	Fixes
CPU	High utilization, request queuing	Computation-heavy work, inefficient algorithms	Optimize, cache results, scale out
Memory	Swapping, OOM kills, GC pauses	Large caches, memory leaks, full dataset loading	Profile, fix leaks, scale up
Disk I/O	High iowait, slow queries	Missing indexes, insufficient RAM for buffer pool	SSD, indexing, caching
Network	Bandwidth saturation, high latency	Large payload transfer, chatty services	CDN, compression, batching

The first step in fixing any performance problem is measurement.

Do not guess which resource is the bottleneck.

Instrument your system with monitoring (covered in Chapter IV) and let the data tell you.

Engineers who guess wrong end up adding more application servers when the real problem is a missing database index.

Single Points of Failure (SPOF)

A single point of failure is any component whose failure brings down the entire system.

If your architecture has one database server, one load balancer, or one DNS server, the failure of that one component means total outage.

SPOFs are the most dangerous scalability and reliability risk in any system. They do not just limit growth. They create catastrophic failure modes where a single hardware malfunction or software crash takes down everything.

Identifying SPOFs requires tracing the path of a user request through your entire system and asking at each step: "if this component fails, what happens?" If the answer is "everything stops," you have found a SPOF.

Eliminating SPOFs requires redundancy at every layer.

Run at least two application servers behind a load balancer.

Run the load balancer in an active-passive or active-active configuration. Use a database with a hot standby that takes over automatically if the primary fails. Use multiple DNS providers or at least multiple DNS servers.

Store data in multiple availability zones.

Not every SPOF needs to be eliminated immediately.

A startup with 100 users can tolerate a few SPOFs because the cost of redundancy outweighs the business impact of occasional downtime.

A payment platform processing billions of dollars cannot tolerate any. Your tolerance for SPOFs should match your system's business criticality and your availability SLA.

Hot Spots and Data Skew

Even when you scale horizontally, some components can be overwhelmed if traffic is unevenly distributed. A hot spot is a server or partition that receives disproportionately more traffic than its peers.

Hot spots commonly occur in sharded databases.

If you shard by user ID and one user (say, a celebrity with 50 million followers) generates 1,000x more read traffic than average users, the shard holding that user becomes overloaded while other shards sit idle. This is data skew: the data or traffic is not evenly distributed across partitions.

Hot spots also appear in caches (a viral piece of content hammers one cache key), message queues (one partition receives far more messages than others), and application servers (if sticky sessions route all of a popular user's traffic to one server).

Fixes depend on the layer.

For caches, replicate hot keys across multiple nodes (covered in Part II, Lesson 3).

For databases, add a secondary index or a separate cache specifically for hot records.

For sharding, use consistent hashing with virtual nodes to distribute data more evenly, or split hot shards into smaller pieces. Some systems detect hot spots automatically and redistribute traffic in real time.

Connection Limits and Thread Pool Exhaustion

Every server has a finite number of connections it can maintain simultaneously. A database server might support 500 concurrent connections.

A web server might handle 10,000 concurrent TCP connections. When those limits are reached, new requests queue up or get rejected.

Thread pool exhaustion is a related problem. Many application frameworks process requests using a pool of worker threads.

If all threads are busy (especially if some are waiting on slow I/O operations), new requests sit in a queue.

If the queue fills up, requests are dropped.

These limits are insidious because they do not show up as high CPU or memory usage.

The server's resources look fine, but it stops accepting new work because all its connection slots or threads are occupied.

Fixes include connection pooling (sharing a smaller number of database connections across many requests), asynchronous I/O (using non-blocking frameworks like Node.js or Netty that handle thousands of connections per thread), timeout tuning (ensuring that stuck connections are closed rather than holding slots forever), and increasing the maximum connections (though this is a band-aid; the underlying problem is usually slow downstream calls or missing timeouts).

Thundering Herd Problem

A thundering herd happens when a large number of processes or requests are all waiting for the same event, and when that event occurs, they all activate simultaneously and overwhelm the system.

The most common scenarios are cache expiration (a popular cache entry expires and thousands of requests simultaneously hit the database to rebuild it, which was covered in Chapter II as cache stampede), service recovery (a failed service comes back online and all queued requests flood it at once), lock release (many threads waiting on a lock all wake up when it is released, but only one can acquire it; the rest waste CPU competing), and auto-scaling (new instances come online and health checks mark them as healthy simultaneously, causing a sudden traffic shift).

The general pattern for preventing thundering herds is to stagger the activation.

Use jittered retries so that failed requests retry at random intervals rather than all at the same moment.

Use gradual traffic ramp-up when bringing a recovered service back online.

Use lock-based cache rebuild (covered in Chapter II) so only one request rebuilds the cache while others wait.

Use gradual health check promotion so new instances absorb traffic incrementally rather than all at once.

Interview-Style Question

Q: Your system has 50 microservices. After a deployment, one service starts responding slowly (500ms instead of 50ms). Within minutes, response times degrade across the entire system. What is happening?

A: This is a cascading failure caused by a bottleneck in one service. The slow service is holding open connections from all upstream services that call it. Those upstream services exhaust their connection pools and thread pools waiting for the slow service, causing them to slow down as well. Their callers also back up, and the slowness propagates through the entire system like a wave. The fix has two parts: immediate and preventive. Immediately, circuit break the slow service so callers fail fast instead of waiting 500ms. Preventively, set aggressive timeouts on all inter-service calls (if normal response time is 50ms, set the timeout to 200ms). Implement circuit breakers on every service boundary. Use bulkheads to isolate connection pools per downstream service so that one slow dependency cannot exhaust all connections.

Cascading Failure vs. Circuit Breaker

KEY TAKEAWAYS

Bottlenecks can be CPU, memory, disk I/O, or network. Measure before you guess. Adding servers does nothing if the bottleneck is a missing index.
Single points of failure are components whose failure crashes the entire system. Eliminate them with redundancy at every layer, proportional to business criticality.
Hot spots and data skew cause uneven load distribution even in horizontally scaled systems. Detect them through monitoring and fix them with replication, cache distribution, or shard rebalancing.
Connection limits and thread pool exhaustion silently cap your throughput. Use connection pooling, async I/O, and strict timeouts.
Thundering herds happen when many processes activate simultaneously. Prevent them with jitter, gradual ramp-up, and lock-based coordination.
Cascading failures propagate a single bottleneck across the entire system. Circuit breakers, timeouts, and bulkheads contain the blast radius.

Designing Highly Scalable Systems

You now understand scaling strategies and the bottlenecks that limit them.

This section ties everything together into architectural patterns that let systems grow from thousands to millions to hundreds of millions of users.

Partitioning and Sharding at Every Layer

Sharding is not just a database technique.

The principle of splitting a workload into independent pieces applies at every layer of your system.

At the application layer, you partition by function. The user-facing web tier, the internal API tier, and the background processing tier run as separate services that scale independently.

If your background job workers need 40 servers during a batch processing window but your web tier only needs 10, you scale them separately.

At the data layer, you shard databases by partition key (covered in Part II, Lesson 2) and distribute cache entries across a cache cluster using consistent hashing. Each shard or node handles a manageable subset of the total data.

At the queue layer, Kafka topics are divided into partitions so multiple consumers process events in parallel.

SQS and RabbitMQ distribute messages across multiple worker instances.

At the CDN and edge layer, content is distributed geographically.

Users in different regions are served by different edge servers, each handling a partition of the global traffic.

The common thread is divide and conquer. Wherever you have a single component bearing the full load, ask yourself whether it can be split into pieces that operate independently. Pieces that do not share state can scale independently without coordination overhead.

Asynchronous Processing and Event-Driven Design

Synchronous request chains create scaling bottlenecks because the entire chain is only as fast as its slowest component.

If your order flow synchronously calls the payment service, inventory service, notification service, and analytics service, the user waits for all four, and the system's throughput is limited by whichever service is slowest.

Event-driven design breaks these chains. The order service processes the payment (the one synchronous step the user must wait for), then publishes an "order completed" event. The inventory, notification, and analytics services each consume the event independently at their own pace. The user gets a response as soon as the payment succeeds. The other operations happen asynchronously.

This pattern does more than improve user-facing latency. It fundamentally changes how your system scales. Each consumer scales independently based on its own workload.

The notification service might need 5 instances while the analytics service needs 20. They do not compete for the same resources or block each other.

Event-driven architecture also handles failure more gracefully at scale.

If the analytics service goes down during a traffic spike, events accumulate in the queue.

When the service recovers, it processes the backlog. No events are lost, and the failure does not cascade to other services.

Read/Write Splitting

Most applications read data far more often than they write it.

A social media platform might have a 100:1 read-to-write ratio. Users scroll through feeds millions of times per hour. They post new content thousands of times per hour. Treating reads and writes identically means the rare writes compete with the frequent reads for the same database resources.

Read/write splitting directs write queries to the primary database and read queries to read replicas.

The primary handles the relatively small write load. The replicas handle the massive read load. Because reads do not modify data, you can add as many replicas as needed.

At the application level, you implement this with a routing layer that inspects each query.

Framework-level support exists in most ORMs and database drivers. Django, Rails, Spring, and other frameworks can be configured to automatically route reads to replicas and writes to the primary.

The complication is replication lag.

After a user creates a post, their next read might hit a replica that has not received the write yet. The user does not see their own post.

The solution is read-your-writes consistency: immediately after a write, direct that specific user's reads to the primary for a short window (a few seconds).

Once the replication lag catches up, subsequent reads can go back to replicas.

CQRS (Command Query Responsibility Segregation)

CQRS takes read/write splitting to an architectural level. Instead of splitting at the database layer, CQRS uses completely separate models (and often separate databases) for reads and writes.

The write side (command side) receives commands like "create order" or "update profile." It validates the command, applies business rules, and writes to a database optimized for writes. This might be a normalized relational database or an event store.

The read side (query side) serves read requests like "show user's order history" or "display product catalog." It reads from a database optimized for reads, often a denormalized database, a search index, or a materialized view that is structured specifically for the query patterns the application needs.

The two sides are kept in sync through events. When the write side processes a command, it publishes an event.

A projection component consumes the event and updates the read-side database.

There is a delay (milliseconds to seconds) between a write and its appearance on the read side. This is eventual consistency by design.

Aspect	Traditional Architecture	CQRS
Data model	One model for reads and writes	Separate models for reads and writes
Database	One database (or replicas)	Potentially different databases for each side
Read optimization	Limited by write-side schema	Read-side schema tailored to query patterns
Write optimization	Limited by read-side queries	Write-side schema optimized for command processing
Consistency	Immediate (same database)	Eventual (sync through events)
Complexity	Lower	Higher

CQRS is powerful when your read and write patterns are dramatically different.

If your writes are complex (validating business rules, maintaining consistency across entities) and your reads are simple but high-volume (displaying precomputed views), CQRS lets you optimize each side independently.

CQRS is overkill for simple CRUD applications where the read and write models are nearly identical.

The added complexity of maintaining two models, syncing through events, and handling eventual consistency is not justified unless the performance or scaling benefit is substantial.

Scaling from Zero to Millions of Users: A Step-by-Step Guide

Here is how a system typically evolves as it grows, using a concrete progression that ties together every concept covered in Parts II and III.

Stage 1: Single server (0 to 1,000 users). One machine runs the application, database, and a basic web server. Everything is on one box. Deployments are simple. Monitoring is basic. You are optimizing for speed of development, not scale.

Stage 2: Separate the database (1,000 to 10,000 users). Move the database to its own server. The application server and database server can now be scaled independently. Add connection pooling. Set up automated backups.

Stage 3: Add a load balancer and multiple app servers (10,000 to 100,000 users). Make the application stateless. Move sessions to Redis. Put two or more application servers behind a load balancer. Add health checks. Set up auto-scaling for the application tier.

Stage 4: Add caching and a CDN (100,000 to 500,000 users). Deploy Redis as an application-level cache for frequent database queries. Put static assets on a CDN. Add HTTP caching headers. Your database load drops dramatically.

Stage 5: Add read replicas (500,000 to 2 million users). The database primary is becoming a bottleneck for reads. Add read replicas and route read queries to them. Implement read-your-writes consistency for the user who just wrote.

Stage 6: Introduce message queues and async processing (2 million to 10 million users). Move email sending, notifications, analytics, and other non-critical work to background queues. The request path becomes faster because it only does the essential synchronous work. Workers process the rest asynchronously.

Stage 7: Shard the database and go multi-region (10 million to 100+ million users). The primary database cannot handle write volume alone. Shard by user ID or another partition key. Deploy to multiple geographic regions with GSLB routing users to the nearest region. Each region has its own application tier, cache layer, and database shards. Cross-region replication keeps data synchronized.

This is not a rigid template. Some systems hit database bottlenecks before adding caching. Others need message queues before they need read replicas. The stages represent a common pattern, not a universal law. The key is to solve each bottleneck as it becomes real rather than pre-building infrastructure for problems you do not have yet.

Architecture Evolution for Scaling

Beginner Mistake to Avoid

The most expensive scaling mistake is optimizing for a problem you do not have yet.

Engineers who read about sharding, CQRS, and event sourcing sometimes try to implement all of them on day one for an application with 200 users. They spend months building infrastructure for a million-user scale and never reach 10,000 users because the product itself was not validated.

Scale when you need to, not when you can.

Stage 1 through 3 of the progression above can be built in weeks.

Stages 4 through 7 should happen only in response to real, measured bottlenecks.

Interview-Style Question

Q: You are designing a social media application. Currently it has 50,000 users, and the product team expects to reach 5 million users within a year. How do you plan for this growth?

A: The system is currently in Stage 3 or 4 of the progression. For the immediate term: ensure the application tier is stateless and auto-scaling, implement Redis caching for the feed and profile queries that make up the bulk of reads, and put all static content on a CDN. For the medium term (approaching 1 million users): add read replicas to the database and implement read/write splitting. Move feed generation, notification delivery, and analytics to asynchronous processing through a message queue. For the longer term (approaching 5 million): monitor database write throughput closely. If the primary becomes a write bottleneck, shard by user ID. Evaluate whether CQRS makes sense for the feed system, where the write model (new posts) and the read model (personalized feeds) are structurally different. Plan for multi-region deployment if the user base is geographically distributed. At each stage, measure first and scale in response to real bottlenecks, not projections.

KEY TAKEAWAYS

Partition and shard at every layer: application, data, queue, and CDN. Wherever a single component bears the full load, split it into independent pieces.
Event-driven design decouples services and lets each one scale independently. Move all non-essential processing off the synchronous request path.
Read/write splitting directs the majority of queries to read replicas, protecting the primary database for writes. Handle read-your-writes consistency for the user who just wrote.
CQRS separates read and write models entirely, allowing each to be optimized for its specific access patterns. Use it when read and write patterns are dramatically different, not for simple CRUD.
Systems evolve through stages: single server, separated database, load-balanced app tier, caching and CDN, read replicas, async processing, sharding and multi-region. Scale in response to real bottlenecks, not theoretical ones.
The most expensive mistake is building for a million users before you have a thousand. Solve the scaling problem you have today, and design so that solving tomorrow's problem does not require a full rewrite.