2.4 Load Balancing - The System Design Interview Handbook

Load Balancing Fundamentals

A load balancer sits between clients and your backend servers. It receives incoming requests and decides which server should handle each one. That decision, repeated millions of times per day, is what keeps your system responsive even when traffic surges.

Why Load Balancers Are Essential

Imagine you have one server handling your entire application. It works fine with 100 users.

At 1,000 users, response times start creeping up. At 10,000 users, the server runs out of CPU and memory, requests start timing out, and your application effectively goes offline.

The obvious fix is to add more servers.

But more servers create a new problem: how does each incoming request know which server to go to?

You cannot give users ten different IP addresses and ask them to pick one.

You need something in front of those servers that accepts all the traffic and distributes it intelligently. That is the load balancer.

Load balancers solve three problems at once. They distribute traffic so no single server is overwhelmed. They enable horizontal scaling because you can add or remove servers behind the load balancer without changing anything on the client side.

And they provide fault tolerance because if one server goes down, the load balancer stops sending traffic to it and routes requests to healthy servers instead.

Without a load balancer, scaling beyond a single server is practically impossible in a production environment. It is one of the first infrastructure components you add when a system outgrows a single machine.

Hardware vs. Software Load Balancers

Hardware load balancers are dedicated physical appliances built specifically for high-throughput traffic distribution.

Companies like F5 and Citrix manufacture them. They sit in your data center rack and can handle millions of connections per second with extremely low latency. They are also expensive.

A single hardware load balancer can cost tens of thousands of dollars, and you typically need at least two for redundancy.

Software load balancers run as programs on standard servers. Nginx, HAProxy, and Envoy are the most widely used.

Cloud providers offer managed software load balancers: AWS has ALB (Application Load Balancer) and NLB (Network Load Balancer), Google Cloud has its Cloud Load Balancing, and Azure has Azure Load Balancer.

Software load balancers have largely replaced hardware load balancers for most use cases. They cost a fraction of the price, are easier to configure and update, and scale horizontally by running multiple instances.

Hardware load balancers still exist in specialized environments where maximum throughput per device matters, like financial trading platforms or telecom networks.

Aspect	Hardware Load Balancer	Software Load Balancer
Cost	High (tens of thousands of dollars)	Low (free open-source or pay-per-use cloud)
Performance	Extremely high, purpose-built silicon	High, limited by underlying server hardware
Flexibility	Limited, vendor-specific configuration	Highly configurable, scriptable
Scaling	Buy more appliances	Add more instances or use managed cloud services
Maintenance	Vendor support contracts	Community or in-house
Best for	Ultra-high-throughput, specialized environments	Everything else

Layer 4 (Transport) vs. Layer 7 (Application) Load Balancing

Load balancers can operate at different layers of the network stack, and this distinction changes what information they can use to make routing decisions.

A Layer 4 load balancer works at the transport layer. It sees TCP or UDP connections and makes routing decisions based on IP addresses and port numbers. It does not inspect the content of the request.

When a connection arrives, the load balancer picks a backend server and forwards all packets for that connection to that server. It is fast because it does minimal processing per packet.

A Layer 7 load balancer works at the application layer. It understands HTTP, HTTPS, and other application protocols. It can read request headers, URLs, cookies, and even the request body.

This means it can make much smarter routing decisions: send API requests to one set of servers and static file requests to another, route requests based on geographic headers, or direct authenticated users to premium servers.

Aspect	Layer 4	Layer 7
Operates at	TCP/UDP (transport)	HTTP/HTTPS (application)
Sees	IP addresses, ports	URLs, headers, cookies, body
Routing decisions	Based on connection metadata	Based on request content
Performance	Faster (less processing per packet)	Slower (parses application data)
Use cases	High-throughput TCP traffic, database connections, gaming	HTTP APIs, web apps, microservice routing
SSL termination	No (passes encrypted traffic through)	Yes (decrypts, inspects, re-encrypts or forwards)

Most web applications use Layer 7 load balancers because the ability to route based on request content is essential for modern architectures.

You use Layer 4 when you need raw speed and do not need content-aware routing, such as distributing database connections or handling non-HTTP protocols.

Interview-Style Question

Q: Your system has two groups of servers: one for serving the public website and one for handling API requests. How would you route traffic to the correct group?

A: Use a Layer 7 load balancer. Configure it to inspect the URL path of incoming requests. Requests to paths starting with /api/ get routed to the API server group. All other requests get routed to the website server group. A Layer 4 load balancer could not do this because it does not see the URL, only the IP address and port. You could alternatively run the API on a separate subdomain (api.example.com) and use DNS routing, but a Layer 7 load balancer gives you finer control and keeps everything behind a single domain.

L4 vs. L7 Load Balancing

KEY TAKEAWAYS

Load balancers distribute traffic across multiple servers, enabling horizontal scaling and fault tolerance.
Hardware load balancers offer extreme throughput but are expensive and inflexible. Software load balancers handle the vast majority of production workloads at a fraction of the cost.
Layer 4 load balancers route based on IP and port. Layer 7 load balancers route based on request content like URLs, headers, and cookies.
Most web applications need Layer 7 load balancing for content-aware routing. Layer 4 is reserved for high-throughput scenarios where content inspection is unnecessary.

Load Balancing Algorithms

The load balancer knows about your backend servers.

A request arrives.

Which server gets it?

The algorithm that answers this question has a direct impact on how evenly your traffic is distributed, how well your system handles heterogeneous hardware, and how users experience your application.

Round Robin and Weighted Round Robin

Round robin is the simplest algorithm.

The load balancer maintains a list of servers and sends each request to the next server in the list, cycling through in order.

Server A gets request 1, server B gets request 2, server C gets request 3, then back to server A for request 4.

This works perfectly when all your servers have identical hardware and every request requires roughly the same amount of processing.

In practice, neither of those conditions is always true.

Weighted round robin adds a refinement. You assign a weight to each server based on its capacity.

A server with twice the CPU and memory gets twice the weight, meaning it receives twice as many requests.

If server A has weight 3 and server B has weight 1, server A handles three out of every four requests.

Variant	How It Works	Best For
Round robin	Rotate through servers equally	Identical servers, similar request costs
Weighted round robin	Rotate with proportional distribution	Servers with different capacities

The weakness of both variants is that they ignore the current state of each server.

A server might be slow because it is processing an expensive query, but round robin sends the next request to it anyway because it is next in line.

Least Connections and Weighted Least Connections

Least connections routes each request to the server with the fewest active connections at that moment.

If server A has 15 active connections and server B has 8, the next request goes to server B.

This algorithm naturally adapts to uneven request processing times.

A server that is handling a slow, expensive request accumulates connections and gets fewer new ones. A server that finishes requests quickly sheds connections and gets more traffic. The distribution self-corrects in real time.

Weighted least connections combines this with capacity weights.

A large server with weight 3 and 30 connections has an effective load of 30/3 = 10.

A small server with weight 1 and 8 connections has an effective load of 8/1 = 8. The small server has lower effective load, so it gets the next request. This balances both current load and server capacity simultaneously.

Least connections is an excellent default choice for APIs and web applications where request processing times vary significantly. It does require the load balancer to track active connections per server, which adds a small amount of state management overhead.

IP Hash and Consistent Hashing

IP hash takes the client's IP address, runs it through a hash function, and uses the result to determine which server handles the request.

The same IP address always maps to the same server (as long as the server pool stays constant).

This gives you session affinity without cookies or other tracking.

If a user with IP 203.0.113.50 gets routed to server B on their first request, every subsequent request from that IP also goes to server B. This is useful for applications that store session state locally on the server.

The problem with simple IP hash is that adding or removing a server changes the hash mapping for most clients, causing widespread session disruption.

Consistent hashing solves this. Servers and request keys are placed on a virtual ring. Each request gets routed to the nearest server clockwise on the ring. When a server is added or removed, only the keys that were mapped to the immediate neighbors of that server get redistributed.

Everything else stays in place. Most of your users keep hitting the same server even when the pool changes.

Consistent hashing is used widely beyond load balancing.

Distributed caches (like Memcached clusters), database sharding, and CDN routing all rely on consistent hashing to minimize disruption when the infrastructure changes.

Random Selection

Random selection picks a server at random for each request. No tracking, no state, no ordering.

It sounds simplistic, but random selection performs well in practice when the server pool is large and request costs are uniform.

With 50 identical servers, random distribution produces a nearly even spread because of the law of large numbers.

The variance decreases as the number of servers and requests increases.

The advantage is zero overhead.

No connection tracking, no hash computation, no round-robin counters.

For extremely high-throughput systems where the load balancer itself is a bottleneck, random selection removes all per-request bookkeeping.

The disadvantage is that it ignores server health, capacity, and current load entirely.

Combine it with health checks so unhealthy servers are removed from the pool, and random selection becomes a viable option for certain workloads.

Resource-Based (Adaptive) Load Balancing

Resource-based load balancing takes the most sophisticated approach. Each backend server periodically reports its current resource utilization: CPU usage, memory consumption, active request count, response time, or a custom health score.

The load balancer uses this information to route requests to the server that is in the best condition to handle them.

If server A is at 90% CPU and server B is at 30% CPU, the next request goes to server B.

If server C is responding in 10ms and server D is responding in 200ms, traffic shifts toward server C.

This approach adapts to real conditions rather than relying on assumptions about capacity or connection counts. It handles heterogeneous hardware, unpredictable workloads, and varying request costs gracefully.

The trade-off is complexity.

Backend servers need an agent or endpoint that reports health metrics.

The load balancer needs to collect, process, and act on those metrics in near-real time. Stale metrics can lead to oscillation, where all traffic suddenly shifts to a server that reported good health a second ago but is now overwhelmed.

Algorithm	Tracks State?	Adapts to Load?	Complexity	Best For
Round robin	No	No	Very low	Identical servers, uniform requests
Weighted round robin	No (static weights)	No	Low	Servers with known, fixed capacity differences
Least connections	Yes (connection count)	Yes	Medium	APIs with variable request processing times
IP hash	No	No	Low	Session affinity without cookies
Consistent hashing	No	No	Medium	Distributed caches, minimal disruption on pool changes
Random	No	No	Very low	Large, homogeneous server pools
Resource-based	Yes (health metrics)	Yes	High	Heterogeneous hardware, unpredictable workloads

Explore more about Load Balancing Algorithms.

Interview-Style Question

Q: You are designing a system where some API endpoints take 10ms to process and others take 5 seconds. All servers are identical. Which load balancing algorithm would you choose?

A: Least connections. Round robin would blindly send the same number of requests to every server, but a server processing a 5-second request is effectively occupied much longer than one processing a 10ms request. With round robin, servers handling slow requests would accumulate a backlog while servers handling fast requests sit idle. Least connections automatically compensates by routing new requests to servers that have finished their work and freed up capacity. If you wanted even finer control, resource-based balancing using response time as the metric would further optimize distribution.

KEY TAKEAWAYS

Round robin is simple and works when servers are identical and requests are uniform. Weighted round robin handles servers with different capacities.
Least connections is the best general-purpose algorithm for APIs where request processing times vary significantly.
Consistent hashing minimizes disruption when servers are added or removed. It is essential for distributed caches and sharded databases.
Random selection has zero overhead and works well for large, homogeneous server pools.
Resource-based balancing adapts to real server conditions but adds complexity from health metric collection and processing.
There is no universally best algorithm. The right choice depends on your server hardware, request patterns, and whether you need session affinity.

Advanced Load Balancing

Once you understand the basics of how load balancers distribute traffic, the next set of questions is about what else they can do and how to keep them running reliably.

Load balancers in production systems handle far more than simple request routing.

SSL/TLS Termination at the Load Balancer

When a client connects to your application over HTTPS, the data is encrypted using TLS. Someone needs to decrypt that data before your application can process it. You have two choices: let each backend server handle encryption and decryption, or let the load balancer do it.

SSL/TLS termination at the load balancer means the load balancer decrypts incoming HTTPS requests, then forwards them as plain HTTP to backend servers over your internal network. The backend servers never deal with encryption overhead.

This has several advantages.

Encryption and decryption are CPU-intensive, especially during the TLS handshake. Offloading this work to the load balancer frees your application servers to spend their CPU cycles on actual business logic.

Certificate management is centralized in one place instead of being scattered across dozens of servers.

And a Layer 7 load balancer that terminates TLS can inspect the decrypted request content to make intelligent routing decisions.

The trade-off is that traffic between the load balancer and your backend servers travels unencrypted.

On a trusted internal network within the same data center, this is generally acceptable.

If your compliance requirements or security posture demand end-to-end encryption, you can use SSL re-encryption, where the load balancer decrypts, inspects, and then re-encrypts the traffic before forwarding it to the backend. This adds latency but satisfies strict security requirements.

Session Persistence (Sticky Sessions)

Some applications store user session data locally on the server.

When a user logs in on server A, their session information (authentication state, shopping cart contents, form progress) lives in server A's memory.

If the next request from that user gets routed to server B, server B has no idea who they are.

Sticky sessions solve this by ensuring that all requests from the same user go to the same server.

The load balancer identifies the user (using a cookie, IP address, or other identifier) and routes all their requests to the server that handled their first request.

This works, but it introduces problems.

If the sticky server goes down, the user loses their session and has to start over. Traffic distribution becomes uneven because some servers accumulate more long-lived sessions than others.

And scaling becomes harder because you cannot freely add or remove servers without disrupting existing sessions.

The better long-term solution is to make your application stateless. Store session data in a shared external store like Redis. Every server can handle any request from any user because session data is accessible from anywhere.

This eliminates the need for sticky sessions entirely and makes your system much easier to scale and recover from failures.

Health Checks and Circuit Breaking

A load balancer is only useful if it knows which servers are healthy and which ones are not. Health checks provide this information.

Active health checks mean the load balancer periodically sends a request (usually an HTTP GET to a dedicated /health endpoint) to each backend server.

If the server responds with a 200 status code, it is considered healthy.

If it fails to respond or returns an error, the load balancer marks it as unhealthy and stops routing traffic to it.

Once the server starts passing health checks again, it gets reintroduced into the rotation.

Passive health checks monitor actual traffic instead of sending synthetic requests.

If the load balancer notices that a server is returning a high rate of 5xx errors or timing out on real requests, it marks that server as unhealthy.

This catches problems that active health checks might miss, like a server that responds to health checks but fails on real traffic.

Circuit breaking takes this a step further.

Instead of waiting for a server to fail completely, a circuit breaker tracks error rates and trips when a threshold is exceeded (say, 50% of requests to server C are failing).

While the circuit is open, no traffic is sent to server C at all.

After a cooldown period, the circuit enters a half-open state and sends a few test requests to check if the server has recovered. If it passes, the circuit closes and normal traffic resumes.

Global Server Load Balancing (GSLB)

Standard load balancers distribute traffic across servers within a single data center or region. GSLB distributes traffic across multiple data centers or regions around the world.

When a user in Singapore makes a request, GSLB routes them to the nearest healthy data center in Asia.

When a user in Germany makes a request, they get routed to a European data center.

If the Asian data center goes down, GSLB redirects Asian users to the next closest healthy region.

GSLB typically works at the DNS level.

When a user resolves your domain, the DNS system returns the IP address of the optimal data center based on the user's geographic location, the health of each data center, and current load distribution.

Some implementations use anycast routing, where the same IP address is announced from multiple locations and network routing automatically sends users to the nearest one.

GSLB is what makes global applications possible.

Without it, a user in Tokyo accessing a server in Virginia would experience 150+ milliseconds of network latency on every single request.

With GSLB, they hit a server that might be 10 milliseconds away.

Load Balancing vs. Failover

Load balancing and failover solve different problems, though they are often handled by the same infrastructure.

Load balancing distributes traffic across multiple healthy servers to optimize performance and resource utilization.

All servers are actively handling traffic at the same time.

Failover is about what happens when a server (or an entire data center) goes down.

Traffic that was going to the failed server needs to be redirected to surviving servers.

Failover can be automatic (the load balancer detects the failure via health checks and redirects within seconds) or manual (an operator changes the configuration).

In practice, a well-configured load balancer provides both. It distributes traffic across healthy servers (load balancing) and automatically stops routing to failed servers (failover). GSLB adds another layer by failing over between entire regions.

Stateless vs. Stateful Load Balancing

A stateless load balancer treats every request independently. It does not remember which server handled the previous request from the same client.

Round robin and random selection are inherently stateless algorithms.

A stateful load balancer tracks information across requests. It remembers connection counts (for least connections), client-to-server mappings (for sticky sessions), and server health status (for adaptive routing).

Stateless load balancers are simpler to operate and scale. You can run multiple load balancer instances without synchronizing state between them. Each instance makes independent routing decisions.

Stateful load balancers offer more sophisticated routing but require state synchronization if you run multiple instances.

If load balancer A knows that server 3 has 50 active connections but load balancer B does not, their routing decisions will be inconsistent.

Most production systems run stateless load balancer algorithms (like round robin or random) combined with external state (like a shared health check registry) to get the benefits of both approaches.

High Availability for Load Balancers: Active-Passive, Active-Active

If your load balancer goes down, your entire system becomes unreachable.

The load balancer itself is a single point of failure unless you make it redundant.

Active-passive configuration runs two load balancers. The primary handles all traffic. The secondary sits idle, monitoring the primary's health.

If the primary fails, the secondary takes over its IP address (using a virtual IP or floating IP mechanism) and starts handling traffic.

The failover typically takes a few seconds. The downside is that the secondary's resources are wasted during normal operation.

Active-active configuration runs two or more load balancers, all handling traffic simultaneously. DNS returns multiple IP addresses for your domain, and traffic is distributed across all active instances.

If one fails, DNS health checks remove its IP address, and remaining instances absorb the traffic. Active-active uses resources more efficiently and provides better throughput than active-passive.

Configuration	How It Works	Failover Time	Resource Efficiency	Complexity
Active-passive	Secondary takes over when primary fails	Seconds	Low (secondary sits idle)	Low
Active-active	All instances handle traffic, redistribute on failure	Near-instant (DNS-based)	High (all instances active)	Higher (state sync, DNS management)

Most cloud-managed load balancers (like AWS ALB) are active-active by default and handle redundancy transparently.

If you are running your own load balancers on-premises, you will need to set up redundancy yourself, typically using tools like keepalived for virtual IP failover in an active-passive configuration, or DNS-based distribution for active-active.

Beginner Mistake to Avoid

New engineers sometimes add a load balancer and consider the problem solved. They forget that the load balancer itself needs to be redundant. They forget to configure health checks, so traffic keeps going to dead servers. They use sticky sessions instead of making their application stateless.

And they pick round robin by default without considering whether their request processing times are uniform.

A load balancer is not a set-and-forget box. It is a critical component that needs careful configuration, monitoring, and redundancy planning.

Interview-Style Question

Q: Your application runs in two AWS regions: us-east-1 and eu-west-1. Users report high latency. How do you ensure users are routed to the closest region, and what happens if one region goes down?

A: Use Global Server Load Balancing, in AWS this means Route 53 with latency-based routing. Route 53 resolves your domain to the IP address of the region closest to the user based on network latency measurements. Configure health checks on both regions. If us-east-1 fails its health checks, Route 53 automatically stops returning its IP address, and all users (including those previously served by us-east-1) are routed to eu-west-1. Within each region, an Application Load Balancer distributes traffic across multiple servers. When us-east-1 recovers and passes health checks again, Route 53 resumes routing nearby users to it. The failover happens at the DNS level, so clients may experience a brief delay (depending on DNS TTL) during the switch.

Global Service Load Balancer

KEY TAKEAWAYS

SSL/TLS termination at the load balancer offloads encryption overhead from backend servers and centralizes certificate management.
Sticky sessions solve the problem of server-local state but create scaling and reliability issues. Making your application stateless is the better long-term approach.
Health checks (active and passive) and circuit breakers ensure the load balancer only sends traffic to healthy servers.
GSLB routes users to the nearest data center globally, reducing latency and enabling regional failover.
Load balancing distributes traffic for performance. Failover redirects traffic from failed servers. A well-configured load balancer provides both.
The load balancer itself must be redundant. Active-passive is simpler; active-active is more efficient. Cloud-managed load balancers handle this transparently.

Up Next: Your load balancer distributes traffic across your servers. But how do you serve static content like images, videos, and JavaScript files to users across the globe without forcing every request to travel back to your origin servers? That is the job of a Content Delivery Network. Chapter II, Lesson 5 covers how CDNs work, push vs. pull strategies, edge computing, and how to choose the right CDN for your system.