1. Problem Definition and Scope
We are designing a large-scale web search engine.
The goal is to build a system that autonomously crawls the internet, indexes the content to make it searchable, and serves relevant results to users with extremely low latency.
- User Groups:
- Searchers: Users typing queries to find information.
- Webmasters (Implicit): Content creators whose pages we crawl.
- Scope:
- We will focus on the three core pillars:
- Crawler: Discovering and fetching pages.
- Indexer: Processing text and building the search index.
- Searcher: Serving ranked results for user queries.
- Out of Scope: Image search, Video search, News, Ads, Personalized user history, and "Knowledge Graph" widgets.
- We will focus on the three core pillars:
2. Clarify functional requirements
Must Have:
-
Crawling: The system must discover new URLs and re-visit existing ones to keep content fresh.
-
Indexing: The system must parse HTML, extract text, and build an efficient index mapping words to documents.
-
Search: Users can input a text query and receive a list of relevant URLs.
-
Ranking: Results must be sorted by relevance (using signals like keyword frequency and page popularity).
-
Snippets: Results should display the page title and a short text preview containing the search terms.
Nice to Have:
-
Typeahead: Autocomplete suggestions as the user types.
-
Deduplication: The system should recognize and merge duplicate content.
Functional Requirements
3. Clarify non-functional requirements
- Scale:
- Index Size: ~50 billion web pages.
- Traffic: ~5 billion searches per day (~60,000 QPS average).
- Latency:
- Search: Extremely fast. Results must return in < 500ms (p95).
- Crawl/Index: Slower is acceptable. New pages can take minutes or hours to appear in search.
- Availability:
- 99.99% uptime for the search service. Partial results are better than no results.
- Consistency:
- Eventual Consistency. It is acceptable if the index is slightly behind the live web.
- Read/Write Pattern:
- Read-Heavy: High QPS for search.
- Write-Heavy: Continuous high-throughput background writing for the crawler.
Non-Functional Requirements
4. Back of the envelope estimates
- Traffic Estimates (Search):
- 5 billion searches / day.
- Seconds in a day ≈ 86,400.
- Average QPS =$5 \times 10^9 / 86,400 \approx 58,000$ QPS.
- Peak QPS ≈ 2x Average ≈ 116,000 QPS.
- Storage Estimates (Index):
- Total pages: 50 billion.
- Average page size (compressed text + metadata): 100 KB.
- Raw Data: $50 \times 10^9 \times 100 \text{ KB} = 5 \text{ Petabytes (PB)}$.
- Inverted Index: Typically 20-30% of raw size 1.5 PB.
- Bandwidth (Crawling):
- To refresh ~20% of the web daily (10 billion pages).
- $10 \times 10^9 \text{ pages} \times 100 \text{ KB} / 86,400 \text{ sec} \approx 11.5 \text{ GB/s}$ ingress.
Back-of-the-envelope estimation
5. API design
We expose a simple REST API for the search client. The Crawler and Indexer are internal background processes.
Endpoint: GET /v1/search
- Parameters:
-
q (string): The search query (e.g., "system design").
-
offset (int): Pagination start index (default 0).
-
limit (int): Number of results (default 10).
-
lang (string): User language (e.g., "en-US").
-
- Response:
{"meta": { "took_ms": 145, "total_results": 230000 }, "results": [ { "title": "System Design Interview Guide", "url": "https://example.com/guide", "snippet": "Learn how to design scalable systems..." }, ... ]} - Status Codes: 200 OK, 400 Bad Request, 503 Service Unavailable.
6. High level architecture
We divide the system into three decoupled subsystems: Ingestion (Crawling), Processing (Indexing), and Serving (Querying).
Component Explanations:
-
URL Frontier: A priority queue that manages which URL to visit next. It handles "politeness" (not hammering one site).
-
Fetcher: Downloads the HTML of the page.
-
Content Store: A distributed database (BigTable) storing the raw downloaded HTML.
-
Index Builder: Background workers (MapReduce) that parse text and build the "Inverted Index".
-
Index Store: Distributed file system storing the inverted index shards.
-
Search Service: Orchestrates the user request, fans it out to index shards, and aggregates results.
-
Link Graph: Stores the network of links between pages to calculate PageRank.
High-level Architecture
7. Data model
We need specialized data structures for scale.
1. Content Store (BigTable / HBase)
Stores the "truth" of the web.
- Key: ReverseURL (e.g., com.wikipedia.en/system_design).
- Why? Keeps pages from the same domain physically close, optimizing compression and read performance.
- Columns: HTML_Content, Parsed_Text, Last_Crawled.
2. Inverted Index (Custom Distributed Store)
Maps words to documents.
- Key: Term (hashed word, e.g., "apple").
- Value: PostingList.
- Structure: [ {DocID, Frequency, Positions[]}, ... ].
- Why? Allows O(1) lookup to find all docs containing a word. Positions are needed for phrase searches (e.g., "Steve Jobs" requires "Jobs" to appear right after "Steve").
3. Link Graph
Used for ranking.
- Key: DocID.
- Value: List of Incoming_DocIDs (pages that link to this one).
8. Core flows end to end
To understand how the system works at scale, we must look beyond the static architecture and follow the data.
We can break the system into three distinct pipelines: Crawling (The Explorer), Indexing (The Organizer), and Serving (The Librarian).
Note: Flow 1 and Flow 2 happen in the background (asynchronous) to build the database. Flow 3 happens in real-time (synchronous) when the user actually searches.
Flow 1: Crawling (The Write Path)
Goal: Systematically discover new content and refresh existing pages without overloading the internet or our servers.
- Prioritization (URL Frontier): The lifecycle begins at the URL Frontier. This isn't a simple FIFO queue; it is a smart priority scheduler. It selects the next URL to visit based on:
- Freshness: News sites are crawled every few minutes; static blogs every few weeks.
- Politeness: This is critical. The system checks, "Have we hit
wikipedia.orgin the last second?" If yes, it delays the request to avoid unintentionally DDoS-ing the target server.
-
DNS Resolution: The Fetcher (worker node) picks up the URL. Before connecting, it must resolve the hostname to an IP address. Since standard DNS lookups are slow (20-100ms), the Fetcher uses a dedicated, high-performance DNS Cache to drop this latency to <1ms.
-
Fetching & Validation: The Fetcher reads the site's
robots.txtfile to verify we are legally allowed to crawl the content. If allowed, it downloads the HTML. -
Deduplication (The "Fingerprint" Check): The web is filled with duplicate content (mirrors, reposts). We calculate a 64-bit checksum (fingerprint) of the downloaded HTML.
- Check: We compare this hash against the Content Store.
- Action: If the hash exists, the page hasn't changed; we discard the data to save processing power. If it is new, we proceed.
- Storage & Extraction:
The raw HTML is compressed and saved to the Content Store (BigTable). Simultaneously, we parse the HTML to extract outgoing links (
<a href>), which are sent back to the URL Frontier to drive future crawling.
Flow 1
Flow 2: Indexing (The Processing Path)
Goal: Transform unstructured "Documents containing Words" into a structured "Inverted Index" (Words pointing to Documents).
This is a massive batch process (typically MapReduce) that trades high CPU usage now for fast lookups later.
- Cleaning & Tokenization:
The Index Builder reads raw HTML from the Content Store. It strips tags (
<div>,<script>) to leave only human-readable text. It then splits the text into tokens (words).
- Input: "The quick brown fox."
["the", "quick", "brown", "fox"]
- Normalization: We standardize the tokens to ensure searches match variations:
- Case Folding: "Apple" "apple".
- Stemming: "Running", "runs", "ran" "run".
- Inversion (The "Map" Step):
The system outputs a stream of
{Term, DocID}pairs.
- Doc1 Content: "apple banana"
- Output:
("apple", Doc1),("banana", Doc1)
- Aggregation (The "Reduce" Step): We group all DocIDs associated with a single term to create the Posting List.
- Term: "apple"
[Doc1, Doc5, Doc100...] - Optimization: The list is sorted by DocID and compressed using Delta Encoding (storing the difference between IDs:
100, +5, +3instead of100, 105, 108) to reduce storage by ~70%.
- Sharding: The resulting index is too large for one computer. We split it into thousands of shards (based on DocID) and write them to the distributed Index Store.
Flow 2
Flow 3: Serving (The Read Path)
Goal: Find the "needle in the haystack" from 50 billion pages in under 500ms.
-
Query Understanding: A user searches for "system design." The Search Service cleans the query (spell check) and expands it (e.g., adding synonyms like "software architecture").
-
Scatter (Fan-Out): Since we use Document Partitioning (see Section 11), valid results could be on any shard. The Search Service sends the query to all index shards in parallel.
-
Local Lookup (The "Filter"): Each shard works independently in memory:
- Fetch: It retrieves the posting lists for "system" and "design".
- Intersect: It finds documents containing both words.
- Score: It calculates a "Local Score" using Term Frequency (how often the word appears in the doc).
- Cutoff: It returns only the Top (e.g., 10) results to the aggregator. It does not return the full document, only the DocID and score.
-
Gather (Fan-In): The Search Service receives results from all shards. (If there are 1,000 shards, it now has 10,000 candidates).
-
Global Re-Ranking: With the candidate list narrowed down to 10,000, the Search Service applies expensive "Global Signals" that require cross-reference, primarily PageRank (link authority). It sorts the candidates by
Final_Score = Local_Relevance + PageRank. -
Hydration & Response: The system takes the final Top 10 DocIDs, fetches their titles and snippets from the Snippet Cache/Content Store, and returns the JSON response.
Coach's Note: The "Funnel" Architecture In an interview, it is vital to explain Flow 3 as a funnel.
Step 3 (Local Lookup) uses cheap, fast math (TF-IDF) on billions of documents to find thousands of candidates.
Step 5 (Global Re-Ranking) uses expensive, slow math (ML Models, PageRank) on those thousands of candidates to find the top ten.
We cannot run expensive ranking on all billions of documents instantly; that is why the multi-stage approach is necessary.
Flow 3
9. Caching and read performance
Caching is critical to meet the < 500ms latency target.
- Result Cache:
- Store the full JSON response for popular queries (e.g., "World Cup").
- TTL: Short (seconds/minutes) to ensure freshness.
- Index Cache:
- Cache the Posting Lists for common words (e.g., "video", "news") in memory on the Index Servers.
- This avoids disk I/O for the most frequent terms.
- Snippet Cache:
- Cache the title and summary text for high-ranking DocIDs to avoid hitting the heavy Content Store during result construction.
10. Storage, indexing and media
- Storage Technology: We rely on a distributed file system (like GFS/HDFS) for the index files and a wide-column store (BigTable) for raw docs.
- Compression:
- The Inverted Index is compressed using Delta Encoding.
- Instead of storing DocIDs [100, 105, 108], we store differences [100, 5, 3]. This significantly reduces storage (using Varint encoding).
- Media:
- Images/Videos are stored in Object Storage (S3/GCS).
- We only store the reference URL in the search index.
11. Scaling strategies
We use Sharding to handle 50 billion pages.
Strategy: Document Partitioning (Row Sharding)
-
How it works: We split the web into $N$ shards (e.g., 2,000 shards). Each shard holds the full index for a subset of documents (e.g., Shard 1 has DocIDs 1 to 25M).
-
Querying: The Search Service must query all shards for every request.
-
Pros: Easy to manage updates. If Shard 1 goes down, we only lose results from those documents, not specific words.
-
Cons: "Scatter-Gather" overhead.
-
Why not Term Partitioning? Partitioning by word (A-M on Server 1) creates "hot spots" for common words. Document partitioning distributes load evenly.
Mitigation for Scatter-Gather:
We use a Tiered Index.
- Tier 1 (Hot): Stores high-quality, popular pages (top 10%). Search this first.
- Tier 2 (Cold): Stores the rest. Only search if Tier 1 has few results.
12. Reliability, failure handling and backpressure
- Hedge Requests:
- To reduce tail latency, the Search Service sends the query to a shard replica. If it doesn't reply in a short window (e.g., 100ms), the service sends a second request to another replica. It uses whichever response arrives first.
- Crawler Politeness:
- We must not inadvertently DDoS websites. The URL Frontier enforces a delay between requests to the same domain.
- Circuit Breakers:
- If an index shard is slow, the Search Service times out and returns results from the remaining shards ("Partial Results").
13. Security, privacy and abuse
- SEO Spam:
- "Link Farms" try to game PageRank. We use algorithms (like TrustRank) to penalize sites that have unnatural linking patterns.
- Crawler Traps:
- Malicious sites with infinite dynamic URLs. We limit crawl depth and max pages per domain.
- SafeSearch:
- Classifier models tag content as "Adult" during the indexing phase so it can be filtered at query time.
14. Bottlenecks and next steps
- Bottleneck: Index Updates.
- Issue: Rebuilding the index in batch (MapReduce) takes too long.
- Solution: Implement Lambda Architecture. Use a small "Instant Index" (in-memory) for documents crawled in the last hour, and merge it with the main "Batch Index" (disk) periodically.
- Bottleneck: Scatter-Gather Latency.
- Issue: Waiting for the slowest shard determines the response time.
- Solution: Increase the number of replicas per shard and tune the "Hedge Request" timing to mask slow nodes.
Summary
-
Architecture: Decoupled Ingestion (Crawler/Indexer) from Serving.
-
Data: Used ReverseURL for locality and Inverted Indexes with Delta Encoding for efficiency.
-
Scale: Handled 50B pages using Document Partitioning and Tiered Indexing.
-
Resilience: Used Hedge Requests and accepted Eventual Consistency to ensure high availability.