8.15 Design a rate limiter - The System Design Interview Handbook

1. Problem Definition and Scope

We are designing a distributed middleware system (rate limiter) that limits the number of requests a user or client can send to an API within a specific time period.

The goal is to prevent abuse, ensure fair usage, and protect backend services from being overwhelmed.

Main User Groups:
- End Users/Clients: Mobile apps, websites, or 3rd party developers calling the API.
- Internal Services: The API Gateway, which enforces the decisions.
- Admins: Engineers who configure the rules (e.g., “100 requests per minute”).
Scope:
- We will design a Server-side Rate Limiter integrated into an API Gateway.
- We will focus on the Sliding Window Counter algorithm (specifically the weighted average approach) for high accuracy and efficiency.
- Out of Scope: Client-side throttling (easily bypassed) or Network-layer DDoS mitigation (best handled by Cloudflare/AWS Shield).

2. Clarify Functional Requirements

Must Have:

Throttle Requests: Accurately block requests that exceed a defined threshold (e.g., 10 req/sec).
Distributed State: Limits must be global. If a user hits Server A, then Server B, the count must be cumulative.
Granularity: Support limiting by various keys: User ID, IP Address, or API Key.
Standard Rejection: Return HTTP 429 Too Many Requests with a Retry-After header.
High Performance: The check must happen in real-time (< 20ms) as it blocks every request.

Nice to Have:

Soft Limits: Log a warning if a user crosses a threshold, but allow the request (monitoring mode).
Dynamic Rules: Ability to update limits without restarting the servers.

Functional Requirements

3. Clarify Non-Functional Requirements

Traffic: High scale. Assume 100 Million Daily Active Users (DAU) and 1 Billion requests per day.
Latency: Critical. The limiter sits on the “hot path”. It must add less than 10ms latency.
Availability: Very High (99.99%).
Reliability: Fail-Open. If the rate limiter service goes down or times out, we must allow the traffic to prevent a total system outage.
Consistency: Eventual consistency is acceptable, but synchronization must be fast enough (sub-second) to prevent massive abuse.
Data Retention: Ephemeral. Counters are only needed for the duration of the window (e.g., 1 minute).

Non-Functional Requirements

4. Back of the envelope estimates

Throughput (QPS):
- 1 Billion requests / day.
- Seconds in a day ≈ 86,400 (use 100,000 for simple math).
- Average QPS = 1,000,000,000 / 100,000 = 10,000 QPS.
- Peak QPS (assume 5x burst) = 50,000 QPS.
Storage (Memory):
- We need to store counters for active users in RAM (Redis).
- Assume 10% of users are active in a specific window (10M users).
- Per User: Key (User ID ≈ 8 bytes) + Value (2 Integers ≈ 8 bytes) + Redis Overhead ≈ 50 bytes. Total ≈ 66 bytes.
- Total Memory = 10,000,000 * 66 bytes ≈ 660 MB.
- Conclusion: The entire dataset fits easily into the RAM of a single Redis node, though we will shard it for CPU throughput.

Back-of-the-envelope estimation

5. API design

The rate limiter is internal middleware, but we can define the communication interface.

1. Internal Check Method (RPC/Local Call)

check_limit(key, limit, window_size)
- key: String (e.g., “user_123:login”)
- limit: Integer (e.g., 5)
- window_size: Integer (e.g., 60 seconds)
Returns:
- is_allowed: Boolean
- remaining: Integer
- retry_after: Integer (seconds)

2. Client Response (HTTP Headers)

When the Gateway responds to the client:

X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 0
X-Ratelimit-Retry-After: 30 (Only if blocked)

3. Admin Configuration API

POST /rules
- Body: { “endpoint”: “/api/v1/payment”, “limit”: 10, “window”: “60s” }

6. High-level architecture

We will place the Rate Limiter logic inside the API Gateway. This ensures traffic is rejected at the edge of our infrastructure, saving resources for backend services.

Client: Sends an HTTP request.
Load Balancer: Distributes traffic to API Gateway instances.
API Gateway:
- Serves as the entry point.
- Runs the Rate Limiter Middleware.
- The middleware calculates the user’s key and asks Redis: “Can this user proceed?”
- If Yes: Forwards request to Backend Service.
- If No: Returns HTTP 429.
Redis Cluster: Stores the request counters in memory. We use Redis for its speed and atomic operations (Lua scripts).
Backend Services: The actual application servers (e.g., Order Service).

High-level Architecture

7. Data Model

We will use Redis (Key-Value Store) because we need extremely fast writes and atomic counters.

Algorithm: Sliding Window Counter (Weighted Average)

We choose this over “Fixed Window” (which has edge-case spikes) and “Sliding Log” (which consumes too much memory).

Logic: We track the count for the Current Window (e.g., current minute) and the Previous Window.
Formula: Estimated = (Previous_Count * Overlap_Percentage) + Current_Count

Redis Schema:

We use a Redis Hash to store both counters for a user.

Key: limiter:{user_id}:{endpoint} (e.g., limiter:user_55:login)
Value (Hash):
- current_ts: timestamp (e.g., 10:01)
- current_count: integer (e.g., 15)
- previous_count: integer (e.g., 40)

8. Core flows end to end

We will focus on the most critical path: The Request Evaluation Flow. This flow determines whether a user’s API request is accepted or rejected.

Since this is a distributed system, we must ensure the check is atomic (happens all at once) to prevent race conditions where two users read the same counter simultaneously.

Flow 1: The “Check-and-Act” Cycle (Happy Path)

This process happens for every single incoming request. It must complete in under 10 milliseconds.

1. Request Interception

Action: A client sends a GET /api/orders request.
Gateway: The API Gateway (middleware) intercepts the request before it reaches the backend Order Service.
Identification: The middleware extracts the User_ID (e.g., user_55) from the JWT token or API Key.

2. Constructing the Keys

The middleware identifies the current time window and the previous window.
Example:
- Current Time: 10:01:15 (15 seconds into the minute).
- Current Key: limiter:user_55:10:01
- Previous Key: limiter:user_55:10:00

3. Atomic Check (The Lua Script)

The Gateway sends a single Lua script to the Redis Cluster.
Why Lua? Redis executes Lua scripts atomically. No other request can interrupt the script while it runs. This performs a “Get, Calculate, and Set” operation in one go, preventing race conditions.
The Script Logic:
- Fetch: It retrieves the values for Current Key and Previous Key.
- Calculate: It runs the “Weighted Average” formula (detailed below).
- Decision:
  - If result le Limit: It increments the Current Key and returns ALLOW.
  - If result > Limit: It returns BLOCK without incrementing.

4. Forwarding the Request

Gateway: Since Redis returned ALLOW, the Gateway forwards the HTTP request to the backend Order Service.
Response: The Order Service processes the data and returns it to the client. The user perceives no delay.

Flow 1

Flow 2: Deep Dive into the “Weighted Average” Calculation

This is the “brain” of the rate limiter. We use the Sliding Window Weighted Average algorithm to smooth out traffic spikes at the edges of the minute.

The Scenario

Rule: 10 requests per minute.
Context: It is 10:01:15.
- The user sent 20 requests in the previous minute (10:00).
- The user has sent 2 requests so far in the current minute (10:01).

The Math (Inside Lua Script)

Calculate Overlap:
- We are 15 seconds into the current window.
- This means our “sliding window” (the last 60 seconds) overlaps 75% with the previous minute and 25% with the current minute.
- Weight = 1 - \frac{15}{60} = 0.75$
Estimate Current Rate:
- We take 100% of the current count and add 75% of the previous count.
- Rate = Current_Count + (Previous_Count x Weight)
- Rate = 2+15 =17
The Verdict:
- The estimated rate is 17.
- The limit is 10.
- Since 1710, the request is BLOCKED.

Flow 2

Flow 3: Handling Rejection (The “Blocked” Path)

If the Lua script returns BLOCK, the request flow changes immediately.

1. Short-Circuiting

The API Gateway does not forward the request to the backend service. This protects the backend from load.

2. Local Caching (Optimization)

The Gateway stores a flag in its own local RAM: user_55:BLOCKED for 1 second.
Why? If a malicious bot is spamming 1,000 requests/sec, we don’t want to hit Redis 1,000 times. We check Redis once, block them, and then reject the subsequent 999 requests instantly at the Gateway level.

3. Constructing the Error Response

The Gateway constructs a standard HTTP 429 response.
Headers:
- X-Ratelimit-Limit: 10
- X-Ratelimit-Remaining: 0
- Retry-After: 45 (Telling the client to wait 45 seconds until the window resets).

Flow 3

Flow 4: Failure Handling (Fail-Open Strategy)

What happens if Redis crashes or is too slow?

1. Timeout Detection

The Gateway sends the Lua script but does not receive a reply within 10ms.

2. Default Decision

The Gateway catches the timeout exception.
Instead of returning an error to the user, it defaults to ALLOW.

3. Logging

The Gateway logs an error (”Redis Timeout”) for the DevOps team but lets the traffic pass. This ensures the Rate Limiter never causes a site-wide outage.

Flow 4

9. Caching and read performance

Redis is fast, but 50k QPS is a lot of network calls. We can optimize.

Redis as Primary Store: It holds the definitive “truth” for counters.
Local Memory “Short Circuit”:
- If a user is blocked by Redis, the API Gateway instance caches that “BLOCKED” decision in its own local RAM for a short time (e.g., 1 second).
- Benefit: If a bot scripts 1,000 requests/sec, the first one hits Redis. The next 999 are rejected instantly by the Gateway’s local memory. This saves network bandwidth and protects Redis.
Rule Caching:
- Limit rules (e.g., “login = 5 req/min”) are stored in a database but cached in the Gateway’s memory. We update this cache every few minutes.

10. Storage, indexing and media

Storage: Redis Cluster.
Data Retention (TTL):
- We set a TTL (Time To Live) on the Redis keys equal to Window_Size * 2.
- If a user stops sending requests, their counters expire and are deleted automatically. This keeps memory usage low.
Persistence:
- We can disable Redis disk persistence (RDB/AOF) or make it very infrequent.
- Trade-off: If Redis crashes, we lose the counters. Users get a “limit reset.” This is an acceptable risk for a rate limiter in exchange for maximum write performance.

11. Scaling strategies

Sharding (Partitioning):
- A single Redis node can handle ~50k-80k ops/sec. We need to handle peaks and future growth.
- Strategy: Shard by User ID.
- Shard = Hash(User_ID) % Number_Of_Nodes.
- This distributes the counters evenly across the cluster. If we need more throughput, we simply add more Redis nodes.
Stateless Gateways:
- The API Gateway layer is stateless. We can auto-scale the number of Gateway instances based on CPU load to handle increased traffic volume.

12. Reliability, failure handling and backpressure

Fail-Open Strategy:
- If Redis is down, unreachable, or times out (> 10ms), the Gateway must allow the request.
- Reason: The rate limiter is a helper tool. It should never be the cause of a total system outage. We prefer to let a few spammers through than to block all legitimate revenue-generating traffic.
Circuit Breakers:
- If Redis errors spike, the Gateway opens a circuit breaker and bypasses the rate check entirely for 30 seconds, defaulting to “Allow” to let Redis recover.
Jitter:
- When many users are blocked, they might retry at the exact same second. We add random “jitter” to the Retry-After header (e.g., 30s, 31s, 29s) to smooth out the retry traffic.

13. Security, privacy and abuse

Identity:
- Prefer User ID (Authenticated) over IP. IPs are unreliable (NAT, shared WiFi).
- Use IP limiting only as a fallback for unauthenticated endpoints (like Sign Up or Login).
Security of Headers:
- For public-facing or sensitive endpoints, consider hiding the X-Ratelimit-Remaining header so attackers cannot easily reverse-engineer the exact detection thresholds.
Key Exhaustion:
- Attackers might try to create millions of random keys. The TTL on Redis keys prevents memory from filling up with garbage data.

14. Bottlenecks and next steps

Bottleneck: Network Latency.
- Calling Redis for every request adds latency.
- Next Step: Implement Synchronization Batching. Gateways count requests locally for 5 seconds and then send a single INCRBY command to Redis. This drastically reduces Redis load but makes the limiter “eventually consistent” (less precise).
Bottleneck: Hot Keys.
- A celebrity user or major integration might overwhelm a single Redis shard.
- Next Step: specific “Hot User” logic where their limits are cached and enforced entirely in the Gateway’s local memory, bypassing Redis.

Summary

Architecture: API Gateway Middleware talking to a sharded Redis Cluster.
Algorithm: Sliding Window Counter (Weighted Average) using Lua scripts for atomicity.
Performance: “Fail-Open” policy and local caching for blocked users ensure <10ms overhead.
Scaling: Sharding by User ID allows the system to scale linearly to billions of requests.