4.4 Rate Limiting & Throttling - The System Design Interview Handbook

Why Rate Limiting Is Essential

Rate limiting controls how many requests a client can make to your system within a given time period.

When the limit is exceeded, additional requests are rejected (usually with HTTP 429 Too Many Requests) or delayed until the client is allowed to make more requests.

Without rate limiting, a single client can consume resources that belong to everyone.

One misbehaving script sending 10,000 requests per second to your API can saturate your servers, slow down the database, exhaust connection pools, and degrade the experience for every other user.

Rate limiting prevents this by enforcing fair access to shared resources.

Rate limiting serves several purposes simultaneously.

Protection against abuse: Malicious actors use brute-force login attempts, credential stuffing, web scraping, and API abuse to attack your system. Rate limiting does not stop these attacks entirely, but it makes them dramatically slower and less effective. A brute-force attack that can try 1,000 passwords per second is dangerous. One limited to 5 attempts per minute is practically harmless.
Cost control: If your system calls a third-party API that charges per request (like a payment processor, an SMS provider, or an AI model), a bug in your code that creates an infinite loop of API calls can rack up thousands of dollars in minutes. Rate limiting on outbound calls caps the damage.
Fair resource allocation: In a multi-tenant SaaS platform, one large customer should not be able to monopolize all the server capacity. Per-tenant rate limits ensure every customer gets a fair share of system resources.
System stability during traffic spikes: Even legitimate traffic can overwhelm your system during viral moments, product launches, or flash sales. Rate limiting sheds excess load gracefully by rejecting requests above the threshold rather than letting the entire system crash under pressure.
Preventing cascading failures: If a downstream service is struggling, uncontrolled upstream traffic makes things worse. Rate limiting upstream requests protects the downstream service and gives it room to recover.

Rate limiting is one of the most practical topics in system design interviews.

The interviewer wants to see that you understand both the algorithms and the architectural decisions behind implementing rate limiting at scale.

Token Bucket Algorithm

The token bucket is the most widely used rate limiting algorithm. It is flexible, simple to implement, and handles bursty traffic gracefully.

How It Works

Imagine a bucket that holds a fixed number of tokens.

Tokens are added to the bucket at a constant rate (say, 10 tokens per second).

The bucket has a maximum capacity (say, 50 tokens).

When a request arrives, it takes one token from the bucket. If the bucket has tokens, the request is allowed.

If the bucket is empty, the request is rejected.

The constant refill rate enforces the average rate limit: 10 requests per second over time.

The bucket capacity allows short bursts: a client that has been idle accumulates tokens up to the maximum (50) and can then fire 50 requests in rapid succession.

After the burst, they are throttled back to the refill rate.

Example

Configuration: 10 tokens per second, bucket capacity of 50.

A client has been idle for 5 seconds.

The bucket is full at 50 tokens.

The client sends 30 requests instantly. 30 tokens are consumed. 20 remain.

Over the next 2 seconds, 20 more tokens are added.

If the client sends 15 requests during those 2 seconds, they consume 15 tokens and 25 remain (20 starting + 20 refilled - 15 consumed = 25). The client is never throttled because their burst and sustained rate stay within the limits.

If the client tries to send 60 requests instantly from a full bucket of 50, the first 50 succeed and the remaining 10 are rejected.

The client must wait for tokens to refill before sending more.

Implementation

Token bucket can be implemented with just two values stored per client: the current token count and the timestamp of the last refill.

On each request, calculate how many tokens should have been added since the last refill, update the count (capped at the bucket capacity), and then check if a token is available.

tokens\_to\_add \= (current\_time \- last\_refill\_time) \* refill\_rate  
current\_tokens \= min(current\_tokens \+ tokens\_to\_add, bucket\_capacity)  
last\_refill\_time \= current\_time

if current\_tokens \>= 1:  
 current\_tokens \-= 1  
 allow request  
else:  
 reject request (429)

This is O(1) per request with minimal memory: two values per client.

Why It Is Popular

Token bucket is the default choice for most rate limiters because it naturally accommodates traffic bursts (a real user might click a button rapidly, then go idle), the two parameters (rate and burst capacity) are intuitive to configure, implementation is simple and memory-efficient, and it works well for both API rate limiting and network traffic shaping.

AWS API Gateway, Stripe, GitHub, and many other platforms use token bucket or a variant of it.

Leaky Bucket Algorithm

The leaky bucket processes requests at a fixed, constant rate regardless of how they arrive. It is the strict counterpart to the token bucket.

How It Works

Requests enter a queue (the bucket).

The queue drains (leaks) at a fixed rate.

If the queue is full when a new request arrives, the request is rejected.

If there is space, the request joins the queue and waits until it is processed.

The difference from token bucket is that leaky bucket smooths out bursts entirely.

Even if 100 requests arrive in the same second, they are processed one at a time at the configured drain rate (say, 10 per second).

The first 10 are processed in the first second.

The next 10 in the next second. And so on, as long as the queue has capacity.

Token Bucket vs. Leaky Bucket

Aspect	Token Bucket	Leaky Bucket
Burst handling	Allows bursts up to bucket capacity	Smooths all traffic to a constant rate
Request processing	Immediate if tokens available	Queued, processed at fixed rate
User experience	Better (bursts are allowed)	Stricter (all requests wait in queue)
Implementation	Counter-based (simpler)	Queue-based (more complex)
Best for	API rate limiting, user-facing systems	Traffic shaping, network bandwidth management

Leaky bucket is most useful when you need to protect a downstream system that cannot handle any burst at all.

If your payment processor accepts exactly 100 transactions per second and crashes at 101, a leaky bucket ensures exactly 100 per second reach it, regardless of how the requests arrive.

For most API rate limiting scenarios, token bucket is preferred because users expect their requests to be processed immediately, not queued.

Fixed Window Counter

The fixed window counter is the simplest rate limiting algorithm to understand and implement. It divides time into fixed windows and counts requests within each window.

How It Works

Define a time window (say, 1 minute). Maintain a counter for the current window. Each request increments the counter.

If the counter exceeds the limit (say, 100 requests per minute), reject the request.

When the window expires, reset the counter to zero.

For a limit of 100 requests per minute, the windows might be 10:00-10:01, 10:01-10:02, 10:02-10:03.

A client can make 100 requests in each window. At 10:01:00, the counter resets regardless of when the previous requests were made.

The Boundary Problem

Fixed window has a well-known weakness at window boundaries.

If a client sends 100 requests at 10:00:55 (end of one window) and another 100 requests at 10:01:05 (start of the next window), they have made 200 requests within a 10-second span even though the limit is 100 per minute.

Both windows show 100 requests (within limits), but the burst at the boundary exceeds the intended rate.

This boundary problem means the actual peak rate can be up to 2x the configured limit during the transition between windows.

For many applications, this is an acceptable imprecision. For systems that need strict enforcement, the sliding window algorithms (below) address this flaw.

Implementation

Fixed window counter needs only one counter per client per window.

In Redis, you can implement it with a single key: rate_limit:{client_id}:{window_start_timestamp} with a value of the count and an expiry equal to the window duration. Each request does an INCR on the key and checks if the value exceeds the limit. Redis handles expiry automatically.

Sliding Window Log

The sliding window log eliminates the fixed window's boundary problem by tracking the exact timestamp of every request.

How It Works

For each client, maintain a sorted log of request timestamps.

When a new request arrives, remove all entries older than the window duration (e.g., remove all timestamps older than 1 minute ago). Count the remaining entries.

If the count is below the limit, allow the request and add its timestamp to the log. If not, reject it.

This creates a true sliding window. At any point in time, the system checks exactly how many requests occurred in the last N seconds.

There is no boundary problem because the window slides continuously with each request.

Trade-offs

Sliding window log is the most accurate rate limiting algorithm. It never allows more than the configured limit within any window of the specified duration.

The cost is memory.

Every request timestamp must be stored. For a client making 1,000 requests per minute, the log holds 1,000 entries.

Across millions of clients, this adds up. Each rate limit check also involves scanning and pruning the log, which is more expensive than a simple counter increment.

In Redis, you can implement a sliding window log using a sorted set.

The score is the timestamp and the member is a unique request identifier. ZREMRANGEBYSCORE removes old entries.

ZCARD counts current entries. This is efficient in Redis but still consumes more memory than counter-based approaches.

Sliding Window Counter

The sliding window counter is a hybrid that combines the accuracy of the sliding window log with the memory efficiency of the fixed window counter. It is the algorithm most commonly used in production rate limiters.

How It Works

Maintain two fixed window counters: one for the current window and one for the previous window.

When a request arrives, calculate a weighted count by combining the two windows based on how far into the current window you are.

For example, with a limit of 100 requests per minute and a current time of 10:01:20 (20 seconds into the current window):

Previous window (10:00:00 to 10:01:00) had 80 requests.

Current window (10:01:00 to 10:02:00) has 30 requests so far.

The current position is 20 seconds into the 60-second window, or 33% through. The previous window's weight is 1 - 0.33 = 0.67. Estimated request count = (0.67 × 80) + 30 = 53.6 + 30 = 83.6.

Since 83.6 is below the limit of 100, the request is allowed.

This approach approximates the true sliding window by assuming requests in the previous window were evenly distributed.

The approximation is not exact, but experiments show it is accurate within a few percent for most real traffic patterns.

Why It Is the Best Practical Choice

Sliding window counter needs only two counters per client (previous window count and current window count), regardless of how many requests the client makes.

Memory usage is constant per client, just like the fixed window counter.

But it avoids the 2x boundary problem of fixed windows by weighting the previous window's contribution.

Algorithm	Accuracy	Memory Per Client	Computation Per Request	Burst Handling
Token bucket	Good	2 values (count + timestamp)	O(1)	Allows controlled bursts
Leaky bucket	Strict (constant rate)	Queue size	O(1)	No bursts (queued)
Fixed window	Boundary problem (up to 2x)	1 counter	O(1)	Up to 2x limit at boundaries
Sliding window log	Exact	O(n) per client (all timestamps)	O(n) scan and prune	Exact enforcement
Sliding window counter	Near-exact (within ~few %)	2 counters	O(1)	Near-exact enforcement

Interview-Style Question

Q: You are designing a public API that allows 1,000 requests per hour per API key. Which rate limiting algorithm would you choose and why?

A: Sliding window counter. It provides near-exact enforcement without the memory cost of storing every request timestamp. Fixed window would allow up to 2,000 requests in a boundary-spanning hour, which is a 2x overshoot on a public API that paying customers rely on for fairness. Sliding window log would be accurate but storing up to 1,000 timestamps per API key across potentially millions of keys is expensive. Sliding window counter needs only two counters per key (16 bytes total), approximates the true sliding window within a few percent, and runs in O(1) per request. For a public API where clients expect predictable, fair limits but exact-to-the-request precision is not critical, the sliding window counter is the optimal trade-off between accuracy and efficiency.

Distributed Rate Limiting Challenges

Rate limiting on a single server is straightforward.

The counter or token bucket lives in that server's memory. But when your system runs behind a load balancer with 20 application servers, each server has its own local counter.

A client sending 100 requests might hit 20 different servers, and each server's local counter shows only 5 requests.

The client effectively gets 20x the intended limit.

The Core Problem: Shared State

Distributed rate limiting requires a shared state that all servers can read from and write to atomically.

Redis is the standard solution. Each server checks and updates the rate limit counter in Redis for every request.

Since Redis is centralized, all servers see the same counter, and the rate limit is enforced globally.

Redis is fast enough for this role.

A single Redis instance can handle hundreds of thousands of operations per second with sub-millisecond latency.

For most applications, a single Redis instance (with a replica for failover) handles rate limiting for the entire system.

Race Conditions

Even with Redis, concurrent requests from the same client can create race conditions.

Two requests arrive at the same millisecond.

Both read the current counter value (99 out of a 100 limit).

Both increment to 100.

Both are allowed.

But the counter should have been 101, and one of the requests should have been rejected.

Redis solves this with atomic operations. Using INCR (atomic increment), the first request atomically increments from 99 to 100 and is allowed.

The second request atomically increments from 100 to 101, sees it exceeds the limit, and is rejected.

Lua scripts in Redis can combine multiple operations (check, increment, compare to limit) into a single atomic execution.

Latency Overhead

Every request now includes a Redis round trip (0.5 to 2ms) for the rate limit check.

For latency-sensitive APIs, this overhead matters.

Strategies to reduce it include checking rate limits asynchronously (allow the request and check afterward, rejecting subsequent requests if the limit was exceeded), local approximation (each server maintains a local counter and periodically syncs with Redis, accepting some imprecision for lower latency), and batching (check every 10th request against Redis and estimate the rest locally).

Multi-Region Rate Limiting

If your system runs in multiple regions, each with its own Redis instance, rate limits are enforced per region rather than globally.

A client could get 100 requests per minute in US-East and another 100 in EU-West, effectively doubling their limit.

True global rate limiting requires either a single global Redis instance (introducing cross-region latency for every check), cross-region synchronization between Redis instances (adding complexity and eventual consistency), or accepting per-region limits that sum to a higher global allowance (the simplest approach, and often sufficient).

Most production systems choose per-region limits with a slightly lower per-region threshold so that the global total stays within acceptable bounds.

A client limited to 60 requests per minute per region across 3 regions has a theoretical maximum of 180 globally, which is managed as an acceptable overshoot.

Rate Limiting at the API Gateway vs. Application Layer

Rate limiting can be implemented at different points in your architecture, and the choice affects what you can control and how fine-grained your limits can be.

API Gateway Rate Limiting

The API gateway (covered in Part II, Lesson 1) is the most common place to implement rate limiting.

All external traffic passes through the gateway, making it the natural enforcement point.

Gateway-level rate limiting handles coarse-grained policies: requests per API key per minute, requests per IP address per second, requests per tier (free users get 100/hour, paid users get 10,000/hour). Managed gateways like AWS API Gateway, Kong, and Cloudflare include rate limiting as a built-in feature with minimal configuration.

The advantage is simplicity. You configure the limit once at the gateway, and every request is checked before it reaches any backend service. Rejected requests consume minimal resources because they never pass the gateway.

The limitation is granularity.

The gateway typically does not understand your application's business logic. It cannot enforce rules like "each user can create at most 5 projects per day" or "a customer can initiate at most 3 refunds per month."

These rules depend on application-level context that the gateway does not have.

Application Layer Rate Limiting

Application-level rate limiting runs inside your service code. It has access to the full request context: the authenticated user, the specific operation being performed, the user's subscription tier, and any business-specific attributes.

This enables fine-grained limits: a user can post 10 comments per hour but only create 2 accounts per day.

A free-tier customer can run 100 API queries per day but a premium customer gets 10,000.

A seller can list 50 products per hour but only update prices 200 times per day.

Application-level rate limiting is more complex to implement because every service that needs rate limiting must include the logic, the Redis integration, and the configuration management.

Libraries and middleware frameworks (like rack-attack for Ruby, django-ratelimit for Python, or express-rate-limit for Node.js) simplify this.

Combining Both Layers

Production systems typically use both.

The API gateway enforces broad, infrastructure-level limits: total requests per API key, per IP, and per endpoint. These limits protect the system from abuse and overload.

The application layer enforces business-specific limits: per-user quotas, per-operation throttles, and tier-based allowances.

Aspect	API Gateway	Application Layer
Enforcement point	Before traffic reaches backend	Inside the service
Context available	API key, IP, endpoint, headers	Full user identity, business context
Granularity	Coarse (per key, per IP)	Fine (per user, per operation, per tier)
Implementation	Gateway configuration (often built-in)	Code in each service (libraries, middleware)
Resource consumption	Minimal (rejected at the edge)	Some (request reaches the service before rejection)
Best for	Global abuse prevention, DDoS mitigation	Business logic quotas, user-level throttling

Communicating Limits to Clients

Good rate limiting is transparent.

Clients should know their limits, their current usage, and when they can retry after being throttled.

Standard HTTP headers communicate this information.

X-RateLimit-Limit: 100 tells the client the maximum requests allowed in the window. X-RateLimit-Remaining: 23 tells the client how many requests they have left. X-RateLimit-Reset: 1680000000 tells the client when the window resets (Unix timestamp). Retry-After: 30 (on a 429 response) tells the client how many seconds to wait before retrying.

Including these headers lets well-behaved clients self-regulate. They can slow down as they approach the limit and back off when throttled, reducing unnecessary rejected requests.

Beginner Mistake to Avoid

New engineers sometimes implement rate limiting only at the application layer and skip the gateway. This means every abusive request still reaches your backend, consumes network bandwidth, passes through the load balancer, and occupies an application thread before being rejected.

A volumetric attack sending millions of requests per second will overwhelm your application even if every request is rejected, because the rejection itself consumes resources.

Gateway-level (or CDN-level) rate limiting stops abusive traffic before it reaches your backend infrastructure.

Always have at least a basic rate limit at the gateway, even if your fine-grained limits live in the application layer.

Interview-Style Question

Q: You are designing a rate limiter for a multi-tenant API platform. Free-tier customers get 100 requests per hour. Paid customers get 10,000 requests per hour. The platform runs across 3 regions with 15 servers per region. How do you implement this?

A: Two layers. At the API gateway, enforce a global per-IP rate limit (like 1,000 requests per minute) to protect against DDoS and unauthenticated abuse. This catches volumetric attacks before they reach the backend. At the application layer, implement per-customer rate limiting using a sliding window counter in Redis. When a request arrives, identify the customer from their API key, look up their tier (free or paid), and check their counter in Redis against the corresponding limit (100 or 10,000 per hour). Use Redis INCR with TTL for atomic counter management. For multi-region: run a Redis instance in each region and enforce per-region limits. Set per-region limits to the full per-customer limit divided by the number of regions (33 for free, 3,333 for paid per region), with a small buffer for uneven distribution. This slightly under-allocates but prevents a customer from exceeding their global limit by spreading requests across regions. Return X-RateLimit-Remaining and Retry-After headers on every response so clients can self-regulate.

Rate Limiting Architecture

KEY TAKEAWAYS

Rate limiting protects your system from abuse, controls costs, ensures fair resource allocation, and prevents cascading failures during traffic spikes.
Token bucket is the most popular algorithm. It allows controlled bursts while enforcing an average rate. It is simple, memory-efficient, and handles real-world traffic patterns well.
Leaky bucket enforces a strict constant rate with no bursts. Use it when downstream systems cannot handle any traffic spikes.
Fixed window counter is the simplest but allows up to 2x the limit at window boundaries. Sliding window counter fixes this with minimal added complexity.
Sliding window log is the most accurate but consumes the most memory. Sliding window counter is the best practical compromise between accuracy and efficiency.
Distributed rate limiting requires shared state, typically Redis. Use atomic operations (INCR, Lua scripts) to prevent race conditions.
Implement rate limiting at both the API gateway (coarse, infrastructure-level protection) and the application layer (fine-grained, business-logic-aware limits).
Always communicate limits to clients through standard HTTP headers so well-behaved clients can self-regulate.