8.14 Design an autocomplete/typeahead system - The System Design Interview Handbook

1. Problem Definition and Scope

We are designing a real-time suggestion system similar to the search bar on Google or Amazon. As a user types a query, the system predicts and displays a list of the most likely completed search terms to save the user time.

User Groups:
- Searchers: End-users typing in the search box who need instant feedback.
- Data Analytics: Background processes that analyze search history to determine which terms are popular.
Main Actions:
- User types a prefix (e.g., "iPho").
- System returns a ranked list of suggestions (e.g., "iPhone 15", "iPhone case").
Scope:
- We will focus on the Suggestion Service (backend API) and the Data Aggregation Pipeline (ranking algorithm).
- Out of Scope: We will not design the actual Search Engine Results Page (SERP), spell checking, or complex personalization (user-specific history) in this iteration.

2. Clarify functional requirements

Must Have:
- Prefix Matching: The system must return valid phrases that start with the user's input (e.g., "des" -> "design", "desk").
- Popularity Ranking: Suggestions must be ranked by historical search frequency (e.g., "facebook" appears before "face mask" if more people search for it).
- Low Latency: The system must be incredibly fast to keep up with typing speed.
- Data Freshness: The rankings should update periodically (e.g., daily) to reflect new search trends.
Nice to Have:
- Fuzzy Matching: Handling small typos (e.g., "amzon" -> "amazon").
- Trending: Detecting viral news events within minutes.

Functional Requirements

3. Clarify non-functional requirements

Target Users: 100 million Daily Active Users (DAU).
Read vs. Write: Extremely Read-Heavy. Every keystroke triggers a read request, but users only "submit" a final search once per session.
Latency: Critical. The API must respond in < 100ms (ideally < 20ms internal processing time) to ensure the UI doesn't feel laggy.
Availability: High (99.99%). If the service is down, the search bar degrades to a simple text box, but we want to avoid this.
Consistency: Eventual consistency is acceptable. It is okay if the "popularity count" for a query is a few hours old.
Data Retention: We only need to serve the top N popular queries. We don't need to index rarely used strings.

Non-Functional Requirements

4. Back of the envelope estimates

Traffic (QPS):
- 100M DAU.
- Assume 10 searches per user per day.
- Total Searches = 1 billion/day.
- Assume average of 5 keystrokes per search (requests).
- Total Requests = 5 billion/day.
- Average QPS = $5,000,000,000 / 86,400 \approx 58,000$ QPS.
- Peak QPS (2x average) = ~116,000 QPS.
Storage (The Index):
- We need to store the "dictionary" of popular queries.
- Assume we keep the top 100 million distinct search terms.
- Average term length = 30 bytes.
- Index Size = $100,000,000 \times 30 \text{ bytes} \approx \mathbf{3 \text{ GB}}$.
- Conclusion: The entire dataset fits comfortably in the RAM of a modern server (which usually has 64GB+). This is the key constraint that defines our architecture.

Back-of-the-envelope estimation

5. API design

We need a lightweight read API for the client.

1. Get Suggestions

Method: GET
Path: /v1/suggestions
Parameters:
- q (string): The prefix typed so far (e.g., "sys").
- limit (int, optional): Max results to return (default 5).
Response:


{  
  "suggestions": \[  
    "system design",  
    "system design interview",  
    "system of a down"  
  \]  
}

Status Codes: 200 OK, 503 Service Unavailable.

Note: We generally do not expose a POST API for adding terms. Terms are learned asynchronously from search logs.

6. High level architecture

We will separate the system into two decoupled parts: the Online System (Read Path) for serving suggestions fast, and the Offline System (Write Path) for processing data.

Online Path (Read)

Client -> Load Balancer -> Autocomplete Service (In-Memory Trie)

Offline Path (Write/Update)

Client (Search Logs) -> Analytics Service -> Kafka -> Aggregators -> Database -> Trie Builder -> Object Storage

Client: Sends a request on every keystroke (debounced).
Load Balancer: Distributes traffic across Autocomplete servers.
Autocomplete Service: Stateless web servers that hold the search index in RAM. They do not query a database during the request.
Analytics Service / Kafka: Buffers the massive stream of search logs when users hit "Enter".
Aggregators: Workers that count query frequencies (e.g., "iphone" was searched 10k times).
Database: Stores the master list of query counts.
Trie Builder: Periodically scans the DB, builds the optimized data structure, and saves it to S3.

High-level Architecture

7. Data model

Since standard SQL queries (LIKE 'sys%') are too slow for 116k QPS, we need a specialized data structure.

The Trie (Prefix Tree)

A Trie is a tree where each node represents a character.2

Root: Empty.
Node Structure:
- char: The character (e.g., 'a').
- children: Map/Link to next nodes.
- is_word: Boolean (true if this ends a valid query).
- Top-K Cache (Crucial Optimization):
  - Searching a subtree to find the top 5 most popular completions is slow ($O(N)$).
  - Solution: We pre-calculate the Top 5 queries for every node during the build process and store them in the node itself.
  - Example: The node 's' -> 'y' -> 's' stores a list: ["system design", "system failure", "systolic"].
  - Complexity: Lookup becomes $O(1)$ (constant time) because we just read the pre-computed list.

Storage Schema (Offline DB)

For the persistent storage (Cassandra or DynamoDB):

Table: SearchFrequencies
Columns:
- query_term (PK): String ("system design")
- count (Int): Frequency (15,400)
- last_updated: Timestamp

8. Core flows end to end

Section 8 is the "heartbeat" of the system design. It connects the static components (database, cache, servers) into dynamic workflows.

The most important takeaway from this section is that the system is split into two completely decoupled timelines: The Hot Path (Read) which happens in milliseconds, and The Cold Path (Write) which happens over hours or days.

Flow 1: The "Hot" Path (Serving Suggestions)

Goal: Speed. Everything here is optimized to respond in less than 10 milliseconds.

Constraint: No databases. No complex calculations.

This flow is what happens while the user is typing.

User Interaction: The user types a prefix, for example, "ama".
Request Dispatch: The browser sends a GET request. This hits a Load Balancer (LB).
Stateless Routing: Because every Autocomplete Server has a complete copy of the index (the Trie) in its RAM, the LB can send this request to any random server. No sharding logic is required here.
In-Memory Traversal (The Critical Step):
- The server receives "ama".
- It looks into its RAM (Heap).
- It traverses the Trie data structure: Root -> a -> m -> a.
- It lands on the node a.
The "Pre-Computed" Fetch:
- The server does not look at all the children of "ama" to calculate popularity on the fly (that would be too slow).
- Instead, it simply reads a pre-existing list stored right there in the a node: ["amazon", "amazing", "amazon prime"].
Return: The list is serialized to JSON and sent back.

Why this works: We traded Memory for Time. By storing the answers (Top 5) inside every node of the tree, we reduced a complex graph search algorithm into a simple $O(1)$ lookup.

Flow 1

Flow 2: The "Cold" Path (Data Aggregation)

Goal: Accuracy and Consistency.

Constraint: Throughput. It must handle massive streams of data without losing events.

This flow runs in the background to figure out what should be in that Top 5 list tomorrow.

The Trigger: A user searches for "amazon basics". This action generates a log event.
Buffering (Kafka): We don't write directly to the database because search traffic is "bursty" (spiky). If 10 million people search at once (e.g., during the Super Bowl), a database might crash. Kafka acts as a shock absorber (buffer), queueing up the logs.
Aggregation (Spark/Streaming):
- Raw logs are noisy. We might have 1,000 separate log entries for "amazon basics".
- Aggregators combine these into a single update: {"term": "amazon basics", "add_count": 1000}.
Database Commit: These aggregated counts are written to the main database (Cassandra/DynamoDB). This DB holds the "source of truth" for all query counts.
The Trie Builder (The Batch Job):
- Once a day (or hourly), a powerful server wakes up.
- It pulls the top 100M terms from the DB.
- It builds the Trie structure from scratch.
- Crucial Logic: For every node it creates, it looks at all descendants, finds the top 5 by count, and writes them into the node's metadata.
Snapshotting: The builder saves this finished Trie structure as a file to Object Storage (S3).

Flow 2

Flow 3. The "Zero Downtime" Update

Section 8 briefly mentions how the "Cold" path updates the "Hot" path. This is a common interview stumbling block.

The Problem: How do we update the Trie in RAM on the live servers without stopping them from answering user queries?

The Solution (Double Buffering):

The live Autocomplete Server detects a new file is available on S3.
It downloads the file.
It loads the new Trie into a separate block of memory (Side B), while still serving requests from the old Trie (Side A).
Once Side B is fully loaded and validated, the server performs an Atomic Swap. It changes the pointer from Side A to Side B.
Side A's memory is then freed (garbage collected).

Summary of Differences

Feature	Flow 1 (Hot / Read)	Flow 2 (Cold / Write)
Primary Goal	Low Latency (<10ms)	High Throughput & Data Integrity
Data Source	In-Memory RAM (Trie)	Search Logs -> Database
Tech Stack	Web Server + Trie Cache	Kafka, Spark, Cassandra, S3
Complexity	Simple Lookup ($O(1)$)	Complex MapReduce/Aggregation

Flow 3

9. Caching and read performance

Browser Cache:
- Extremely effective here. If a user types "fast", backspaces to "fas", then types "fast" again, the browser should serve the result from its local cache.
- We set Cache-Control: max-age=300 (5 minutes).
Server-Side Cache (The Trie):
- The Trie is the cache. By keeping the entire dataset in RAM, we eliminate database latency.
CDN:
- We can cache responses for short, high-volume prefixes (like "a", "th", "wh") at the CDN edge. These results rarely change but account for huge traffic.

10. Storage, indexing and media

Primary Data (Logs/Counts):
- Cassandra or DynamoDB: We need a database that handles massive write throughput (logging searches) and simple key-value updates.
Index Storage (S3):
- We treat the Trie as a static "artifact" or blob. We store versioned snapshots in Object Storage.
- This allows us to rollback instantly if a bad update (e.g., a spam attack) pollutes the index—we just tell servers to load yesterday's file.

11. Scaling strategies

Replication (Read Scaling):
- Since the index is ~3GB, it fits on any standard server.
- We do not need to shard (split) the data initially. We can simply use Replication.
- We spin up 20, 50, or 100 Autocomplete Servers. Each one holds the full 3GB Trie.
- The Load Balancer uses Round Robin to distribute requests.
Sampling (Write Scaling):
- Logging 116k QPS of writes is expensive and unnecessary.
- We can sample 1 out of every 100 searches for the analytics pipeline.
- Statistically, the most popular terms will still rise to the top, but we save 99% of the processing cost.

12. Reliability, failure handling and backpressure

Fail Open: If the Autocomplete Service crashes or times out, the frontend handles it gracefully. The suggestions dropdown simply doesn't appear; the user can still search manually.
Circuit Breakers: If the Analytics pipeline (Kafka) is full, we drop log events. Losing some search history is acceptable; crashing the user's browser is not.
Zero Downtime Updates: When loading a new Trie version from S3, the server loads it into a separate block of memory first. Once fully loaded, it swaps the pointer. This ensures the server never stops answering requests during an update.

13. Security, privacy and abuse

HTTPS: Encrypt all traffic to protect user search privacy.
Sensitive Data Filter (Privacy):
- Rule: Only index terms that have been searched by at least $N$ unique users (e.g., > 50 distinct user IDs).
- This prevents a user's private data (like a pasted SSN or password) from becoming a global suggestion for everyone else.
Blocklist:
- The Trie Builder must filter out profanity, hate speech, and illegal terms so they never make it into the production index.

14. Bottlenecks and next steps

Bottleneck: Real-time Trending.
- The current design updates daily/hourly. Breaking news (e.g., "earthquake") won't appear instantly.
- Next Step: Introduce a small "Trending Trie" (using Redis) that holds the last 15 minutes of viral queries. The API queries both the Main Trie and the Trending Trie and merges the results.
Bottleneck: Personalization.
- Results are the same for everyone.
- Next Step: Store a user's recent history in a separate cache. When a request comes in, rank matches from their personal history higher than global popularity.

Summary

In-Memory Trie: We fit the data in RAM to achieve <10ms latency.
Top-K Optimization: We pre-compute results at every node to make lookups $O(1)$.
Decoupling: We separate the high-speed Read path from the complex Write/Aggregation path.
Replication: We scale by simply adding more servers with full copies of the small dataset.