Chapter 8: System Design Interview Mastery

8.12 Design a web crawler

1. Restate the Problem and Pick the Scope

We are designing a large-scale web crawler, a system that systematically downloads web pages from the internet, extracts their content and links, and stores everything for later use (such as powering a search engine index, analytics, or archival). The crawler must be polite (respect robots.txt and rate limits), efficient (avoid re-crawling unchanged pages), and scalable (handle billions of URLs).

Main user groups and actions:

  • Internal systems (search engine indexer, data pipeline) -- consume the crawled content for indexing, analysis, or archival.
  • System operators -- configure crawl seeds (starting URLs), crawl policies (depth, frequency, domain priorities), and monitor progress.

Scope decisions:

  • We will focus on the core crawl loop: discovering URLs, fetching web pages (HTML), extracting links, and storing the raw content.
  • We will NOT cover: full-text indexing/search ranking (that is a separate system that consumes our output), JavaScript rendering (we crawl static HTML; a JS renderer can be added as a separate stage), image/video/media crawling, or the end-user search interface.

2. Clarify Functional Requirements

Must-Have Features

  • The system can be seeded with an initial set of URLs to begin crawling.
  • The system fetches the HTML content of each URL via HTTP/HTTPS.
  • The system extracts all hyperlinks from each fetched page and adds newly discovered URLs to the crawl queue.
  • The system stores the raw HTML content of every crawled page along with metadata (URL, fetch timestamp, HTTP status, content hash).
  • The system respects robots.txt rules for each domain -- it checks which paths are allowed before crawling.
  • The system enforces politeness: it does not send more than N requests per second to any single domain (e.g., 1 request per second per domain).
  • The system avoids crawling the same URL twice in a single crawl cycle (URL deduplication).
  • The system detects when page content has not changed since the last crawl (content deduplication) and avoids storing duplicate content.
  • The system supports configurable crawl depth (e.g., follow links up to 5 hops from the seed).
  • The system supports re-crawling pages periodically to keep content fresh.

Nice-to-Have Features

  • Priority-based crawling: more important or frequently-changing pages are crawled more often.
  • Domain-level crawl budgets: limit the total number of pages crawled per domain.
  • Monitoring dashboard showing crawl progress, error rates, and throughput.

Functional Requirements

3. Clarify Non-Functional Requirements

MetricAssumption / Target
Scale of the web~2 billion web pages worth crawling (out of tens of billions total)
Crawl targetCrawl 1 billion pages per month (re-crawl cycle of ~2 months for the full corpus)
Throughput~400 pages per second average; ~1,000 pages per second peak
Fetch latency per pageVaries widely (100 ms to 5 seconds depending on the remote server); our system must tolerate slow servers without blocking
Storage per pageAverage raw HTML: ~100 KB (some pages are 10 KB, some are 500 KB+)
Availability99.9% -- the crawler should run continuously; brief pauses are tolerable
ConsistencyEventual consistency is fine everywhere. We do not need real-time guarantees. Duplicate fetches are wasteful but not harmful.
Data retentionStore the latest version of each page. Optionally keep 1-2 historical versions for change detection. Retain for 1 year.

Non-Functional Requirements

4. Back-of-the-Envelope Estimates

Throughput

Pages/month = 1 billion
Pages/day = 1B / 30 ≈ 33 million
Pages/sec = 33M / 86,400 ≈ 385 QPS (average)
Peak QPS = 385 × 3 ≈ ~1,150 QPS

Storage

Average page size: 100 KB of raw HTML.

Storage/day = 33M × 100 KB = 3.3 TB/day
Storage/month = 3.3 TB × 30 = ~100 TB/month
Storage/year = ~1.2 PB/year (raw HTML only)

With compression (gzip reduces HTML by ~70%):

Compressed storage/year ≈ 1.2 PB × 0.3 = ~360 TB/year

This is large but manageable with object storage (e.g., S3).

URL Frontier Size

We need to track billions of URLs. Each URL record is ~200 bytes (URL string + metadata).

URL frontier size = 2 billion × 200 bytes = ~400 GB

This fits in a distributed key-value store or a sharded database.

Bandwidth

Outgoing (fetching pages from the internet):

Bandwidth = 385 pages/sec × 100 KB = ~38.5 MB/s average = ~308 Mbps
Peak bandwidth ≈ ~1 Gbps

This is a moderate amount of bandwidth, but we need many outgoing connections in parallel because each fetch can take 1-5 seconds.

Concurrency

If each fetch takes ~2 seconds on average, and we need 385 fetches/sec:

Concurrent connections = 385 × 2 = ~770
Peak concurrent = 1,150 × 2 = ~2,300

Back-of-the-envelope estimation

We need thousands of concurrent outgoing HTTP connections. This is best handled with async I/O or a large pool of lightweight workers.

5. API Design

A web crawler is not a traditional user-facing API service. It is a background data pipeline. However, it needs internal control and monitoring APIs for operators.

5.1 Add Seed URLs

FieldValue
Method & PathPOST /api/v1/seeds
Request body{ "urls": ["https://example.com", "https://news.site.com"], "priority": "high", "max_depth": 5 }
Success (201){ "added": 2, "duplicates_skipped": 0 }
Error codes400 -- invalid URL format

5.2 Get Crawl Status

FieldValue
Method & PathGET /api/v1/status
Success (200){ "pages_crawled_today": 33000000, "pages_in_queue": 150000000, "current_qps": 392, "error_rate": 0.03, "domains_active": 85000 }

5.3 Get Page Content (for downstream consumers)

FieldValue
Method & PathGET /api/v1/pages?url={encoded_url}
Success (200){ "url": "...", "html_storage_path": "s3://crawl-data/...", "content_hash": "...", "fetched_at": "...", "http_status": 200 }
Error codes404 -- URL not yet crawled

5.4 Configure Domain Policy

FieldValue
Method & PathPUT /api/v1/domains/{domain}/policy
Request body{ "crawl_rate": 2, "max_pages": 50000, "priority": "medium" }
Success (200){ "domain": "example.com", "policy": { ... } }

Data Output

The primary output of the crawler is not an API response. It is a stream of crawled page records written to object storage and/or a message queue (Kafka). Downstream systems (indexer, analytics) consume from these outputs.

6. High-Level Architecture

High-level Architecture

Component Responsibilities

  • Seed URLs / Re-crawl Scheduler -- provides the initial set of URLs and schedules periodic re-crawls of previously fetched pages based on their change frequency.
  • URL Frontier -- a massive distributed priority queue holding all URLs to be crawled. It orders URLs by priority and ensures per-domain politeness by releasing URLs for a domain only when the rate limit allows. This is the heart of the crawler.
  • Domain Rate Limiter -- enforces the rule "at most N requests per second per domain." It groups URLs by domain and spaces out fetches accordingly. This prevents our crawler from overwhelming any single website.
  • Fetcher Workers -- a large pool of workers (hundreds to thousands) that make outgoing HTTP requests to fetch web pages. They use async I/O to handle thousands of concurrent connections efficiently. Each worker fetches a URL, downloads the HTML, and passes it downstream.
  • robots.txt Cache -- before fetching any page, the system checks whether the URL is allowed by the domain's robots.txt. This file is fetched once per domain and cached for 24 hours.
  • Content Store (S3) -- stores the raw HTML of every crawled page as a compressed object. Cheap, durable, and infinitely scalable.
  • Page Metadata DB -- stores one record per crawled URL: the URL, content hash (for change detection), fetch timestamp, HTTP status, and a pointer to the content in S3. This is the source of truth for "what have we crawled?"
  • Link Extractor + URL Normalizer -- parses the fetched HTML, extracts all <a href> links, and normalizes them (resolve relative URLs, remove fragments, lowercase the domain, etc.).
  • URL Deduplicator -- checks whether a newly discovered URL has already been crawled or is already in the frontier. Uses a Bloom filter for fast approximate checks (with a small false-positive rate) backed by the metadata DB for definitive checks. This prevents the frontier from growing without bound.

7. Data Model

Database Choice

  • Distributed key-value store (e.g., RocksDB-based, or sharded PostgreSQL) for the URL metadata. The primary access pattern is point lookups by URL (or URL hash), which is perfect for a key-value model.
  • S3 / object storage for raw HTML content. We never query by content -- we only store and retrieve by key.
  • Redis or a custom distributed queue for the URL frontier. It needs fast enqueue/dequeue, priority ordering, and per-domain bucketing.

URL Metadata Store

FieldTypeNotes
url_hashCHAR(32) PKMD5 or SHA-256 of the normalized URL. Used as the primary key for fast lookups.
urlTEXTThe full normalized URL
content_hashCHAR(32)Hash of the page content. Compared on re-crawl to detect changes.
content_pathVARCHAR(500)S3 object key where the HTML is stored
http_statusSMALLINTLast HTTP response code (200, 301, 404, etc.)
fetched_atTIMESTAMPWhen the page was last fetched
next_fetch_atTIMESTAMPWhen the page should be re-crawled. Index on this for the re-crawl scheduler.
prioritySMALLINTCrawl priority (higher = crawl sooner)
depthSMALLINTHow many hops from a seed URL

Indexes

  • Primary key on url_hash -- fast lookups for deduplication ("have we seen this URL before?").
  • Index on next_fetch_at -- the re-crawl scheduler scans for URLs whose next_fetch_at is in the past.

URL Frontier (Queue)

The frontier is not a simple FIFO queue. It is structured as:

Per-domain buckets:
example.com -> [url1, url2, url3] (ordered by priority)
news.site.com -> [url4, url5]
blog.dev -> [url6]

Each domain bucket releases URLs at the configured rate (e.g., 1 per second). The frontier picks the highest-priority URL from whichever domain is next eligible. This structure ensures politeness while maximizing throughput across many domains.

Content in Object Storage

Bucket: crawl-content
Key: pages/{url_hash}/{timestamp}.html.gz
Body: Gzip-compressed raw HTML

We include the timestamp in the key to optionally keep multiple versions for change tracking.

8. Core Flows -- End to End

Flow 1: Crawl a Single Page (the Core Loop)

This is the fundamental operation that runs billions of times. Every other flow builds on this.

  • Step 1 -- URL Frontier releases a URL. The frontier maintains per-domain buckets of URLs to crawl. When domain X's rate limit allows another fetch (e.g., at least 1 second has passed since the last fetch to domain X), the frontier dequeues the highest-priority URL from that domain's bucket: https://example.com/page/42. The frontier hands this URL to an available fetcher worker.

  • Step 2 -- Check robots.txt. Before fetching, the worker checks the robots.txt cache for example.com. If the cache has a fresh copy (less than 24 hours old), it checks whether /page/42 is allowed. If the path is disallowed, the worker discards the URL and moves on. If the cache does not have robots.txt for this domain (first time seeing this domain), the worker fetches https://example.com/robots.txt first, parses it, stores it in the cache, and then checks the path. This adds one extra HTTP request per new domain, but it is cached for all subsequent pages on that domain.

  • Step 3 -- Fetch the page via HTTP. The worker sends an HTTP GET request to https://example.com/page/42 with a polite User-Agent header (e.g., MyCrawler/1.0; +https://mycrawler.com/about). The worker sets a timeout of 10 seconds. Several outcomes are possible:

    • 200 OK with HTML body: The normal case. Proceed to Step 4.
    • 301/302 Redirect: The worker follows up to 5 redirects. The final URL is the one we store.
    • 404 Not Found: The worker records this in the metadata DB (so we do not re-crawl a dead page) and moves on.
    • 429 Too Many Requests or 5xx Server Error: The worker backs off. The URL is re-enqueued with a delay (e.g., retry in 10 minutes). After 3 retries, it is marked as failed.
    • Timeout (10 seconds): Treated like a server error -- retry with backoff.
  • Step 4 -- Compute content hash and check for changes. The worker computes a hash (e.g., SHA-256) of the downloaded HTML. It then checks the metadata DB: has this URL been crawled before? If yes, is the new content hash the same as the stored one?

    • Content unchanged: Update fetched_at and next_fetch_at in the metadata DB. Do NOT store the HTML again (saves storage). Move on.
    • Content changed (or first crawl): Proceed to Step 5.
  • Step 5 -- Store the HTML content. The worker compresses the HTML with gzip and uploads it to S3 at pages/{url_hash}/{timestamp}.html.gz. For a typical 100 KB page, compressed to ~30 KB, this upload takes about 20-50 ms to S3 within the same region.

  • Step 6 -- Update the metadata DB. The worker writes (or updates) a record in the URL metadata store: url_hash, url, content_hash, content_path (S3 key), http_status, fetched_at, and a computed next_fetch_at (based on how frequently this page changes -- pages that change often are re-crawled sooner).

  • Step 7 -- Publish the crawled page to downstream consumers. The worker publishes a message to a Kafka topic (crawled-pages): { url, content_path, content_hash, fetched_at }. The search engine indexer, analytics pipeline, or any other downstream system consumes from this topic. This decouples the crawler from its consumers.

  • Step 8 -- Extract links from the HTML. The worker (or a separate link extraction worker) parses the HTML and extracts all hyperlinks (<a href="...">). Each extracted link is normalized: resolve relative URLs against the page's base URL, remove URL fragments (#section), lowercase the domain, and strip tracking parameters where possible. This produces a list of candidate URLs to crawl next.

  • Step 9 -- Deduplicate and enqueue new URLs. For each extracted URL, the worker checks the URL deduplicator:

    • Fast check (Bloom filter): A Bloom filter holding all known URLs is checked. If the Bloom filter says "definitely not seen," the URL is new. If it says "probably seen," we do a definitive check.

    • Definitive check (metadata DB): Look up the url_hash in the metadata DB. If the URL exists and was recently crawled, skip it. If it does not exist, it is truly new.

    • New URL: Insert it into the URL frontier with an appropriate priority (based on domain importance, depth from seed, etc.). It will be crawled in a future cycle.

      Why a Bloom filter? At 2 billion URLs, checking the metadata DB for every single extracted link would be extremely slow. A Bloom filter with 10 bits per element uses ~2.5 GB of memory and answers "have we seen this URL?" in microseconds with a ~1% false-positive rate. The 1% false positives mean we occasionally skip a URL we have not actually seen -- this is an acceptable trade-off for massive speed gain.

  • What the "user" sees: There is no human user in the loop. The internal monitoring dashboard shows a counter ticking up: pages crawled, pages in queue, pages per second. The downstream indexer receives a steady stream of freshly crawled content via Kafka.

Flow 1

Flow 2: Seeding the Crawler and Managing the Frontier

This flow describes how the crawler is initialized and how the frontier stays healthy over time.

  • Step 1 -- Operator submits seed URLs. An operator calls POST /api/v1/seeds with a list of high-quality starting URLs (e.g., major news sites, Wikipedia, popular domains). These are inserted directly into the URL frontier with the highest priority and depth 0.

  • Step 2 -- Seed pages are crawled (Flow 1). The fetcher workers crawl the seed pages and extract links. These "depth 1" links are added to the frontier. As those are crawled, "depth 2" links are discovered, and so on. The crawler gradually explores the web outward from the seeds, like a breadth-first search.

  • Step 3 -- Depth and budget limits are enforced. Each URL carries a depth counter. When the depth exceeds the configured maximum (e.g., 5), the URL is not added to the frontier. Additionally, per-domain crawl budgets cap the total number of pages crawled per domain (e.g., max 50,000 pages from any single domain), preventing the crawler from spending all its resources on one massive site.

  • Step 4 -- Re-crawl scheduler runs continuously. A background process scans the metadata DB for URLs where next_fetch_at < NOW(). These URLs are re-enqueued into the frontier for re-crawling. The next_fetch_at value is set adaptively: pages that change frequently (detected by content_hash changes) get shorter intervals (e.g., re-crawl in 1 day), while pages that rarely change get longer intervals (e.g., re-crawl in 30 days). This focuses crawl resources on content that is actually changing.

  • Step 5 -- Frontier balancing. The frontier continuously rebalances its per-domain buckets. If a domain has thousands of queued URLs but a crawl rate of 1/sec, those URLs will take a long time to process. The frontier prioritizes domains with high-value pages and avoids starving small domains. A simple round-robin with priority weighting works: cycle through domains, spending more time on higher-priority ones.

Flow 2

Flow 3: Handling a Domain for the First Time

This flow illustrates what happens when the crawler encounters a link to a brand-new domain it has never seen before.

  • Step 1 -- New domain discovered. During link extraction, the crawler finds a link to https://newsite.io/article/1. The domain newsite.io has never been crawled before.

  • Step 2 -- Fetch and cache robots.txt. Before crawling any page on this domain, the crawler fetches https://newsite.io/robots.txt. The response is parsed and stored in the robots.txt cache (Redis) with a 24-hour TTL. If the fetch fails (404 or timeout), the crawler assumes all paths are allowed (standard practice).

  • Step 3 -- Initialize a domain bucket in the frontier. The frontier creates a new per-domain bucket for newsite.io with the default crawl rate (e.g., 1 request per second). The first URL (/article/1) is added to this bucket.

  • Step 4 -- DNS resolution and caching. The fetcher resolves newsite.io to an IP address. DNS results are cached locally for 1 hour to avoid redundant DNS lookups (DNS is slow -- 20-100 ms per lookup). The cached IP is used for all subsequent fetches to this domain.

  • Step 5 -- The page is fetched and processed normally (Flow 1, steps 3 onward). Links extracted from this page may point to more pages on newsite.io or to other new domains, continuing the discovery process.

  • Step 6 -- Crawl rate adapts. If newsite.io starts responding slowly or returning errors, the domain rate limiter automatically reduces the crawl rate for this domain (e.g., from 1/sec to 1 every 5 seconds). If the server recovers, the rate gradually increases. This adaptive politeness protects both the target server and our own resources.

Flow 3

9. Caching and Read Performance

What We Cache

  • robots.txt rules (Redis): robots:{domain} -- parsed rules for each domain. TTL: 24 hours. Checked before every fetch. Without caching, we would need to re-fetch robots.txt before every single page fetch.
  • DNS resolutions (local in-memory cache): dns:{domain} -- IP addresses. TTL: 1 hour. Avoids slow DNS lookups on every fetch.
  • URL deduplication (Bloom filter, in-memory): A probabilistic set of all known URLs. Checked for every extracted link. Much faster than hitting the metadata DB.
  • Recently fetched content hashes (local cache on fetcher workers): The last N thousand content hashes, used for quick "has this page changed?" checks before hitting the metadata DB.

Where the Cache Sits

  • robots.txt cache: shared Redis instance, accessed by all fetcher workers before each fetch.
  • DNS cache: local to each fetcher worker process (no need to share -- each worker resolves independently).
  • Bloom filter: loaded into memory on each fetcher worker (replicated, not shared). Periodically rebuilt from the metadata DB.

Cache Update and Invalidation

  • robots.txt: Fetched on first access to a domain, cached with a 24-hour TTL. After TTL expires, re-fetched on the next crawl attempt to that domain.
  • DNS: Standard TTL-based expiry.
  • Bloom filter: Rebuilt periodically (e.g., every 6 hours) from the metadata DB. Between rebuilds, newly crawled URLs are added to the local Bloom filter in memory. A small window of inconsistency is acceptable (worst case: we re-crawl a URL we already crawled -- wasteful but not harmful).

Eviction Policy

LRU for the robots.txt cache (domains not crawled recently are evicted). The Bloom filter does not use eviction -- it is rebuilt periodically. DNS uses TTL-based expiry.

10. Storage, Indexing, and Media

Primary Data Storage

  • S3 (object storage) for raw HTML content. At ~360 TB/year (compressed), S3 is the only cost-effective option. Pages are stored as gzip-compressed objects keyed by {url_hash}/{timestamp}.
  • Distributed key-value store for URL metadata. At ~400 GB for 2 billion URLs, this fits in a sharded RocksDB cluster or a sharded PostgreSQL deployment.

Indexes

  • Primary key on url_hash -- the most critical index. Every deduplication check and every metadata update is a point lookup by url_hash.
  • Index on next_fetch_at -- the re-crawl scheduler scans for URLs due for re-crawling. This is a range scan, not a point lookup.
  • Index on domain (optional) -- for per-domain analytics and crawl budget enforcement.

Media Storage

We are crawling HTML only (per scope). The HTML itself is "media" in this context, stored in S3. If we expanded to crawl images or videos, they would go to the same S3 bucket under different key prefixes, with metadata in the DB.

CDN

Not applicable for the crawler itself (we are fetching from the web, not serving to end users). The downstream search engine would use a CDN to serve cached page snapshots, but that is outside our scope.

Trade-offs

  • S3 cost: ~360 TB/year at $0.023/GB = ~$8,300/year for storage. Cheap. But S3 PUT requests at 385/sec = ~1 billion PUTs/year at $0.005/1000 = ~$5,000/year. Total S3 cost is ~$13K/year -- very reasonable.
  • Metadata DB cost: 400 GB of SSD-backed storage is modest. The main cost is the write throughput (385 writes/sec for metadata updates).
  • Bloom filter memory: ~2.5 GB per worker for 2 billion URLs. Acceptable for modern servers.

11. Scaling Strategies

Version 1: Simple Setup

For a small crawl (millions of pages):

  • A single machine running 100 concurrent fetcher threads with async I/O.
  • A single Redis instance for the URL frontier and robots.txt cache.
  • A single PostgreSQL instance for URL metadata.
  • A single S3 bucket for content.

This can crawl ~50-100 pages/second and handle tens of millions of pages.

Growing the System

Horizontal scaling of fetchers: The fetcher pool is stateless. Add more machines to increase throughput. 10 machines with 100 concurrent connections each = 1,000 concurrent fetches = ~400-500 pages/second. 50 machines = ~2,000-2,500 pages/second. Each machine pulls URLs from the shared frontier.

Distributed URL Frontier: At scale, a single Redis instance cannot hold billions of URLs. Shard the frontier by domain: hash(domain) % N_shards determines which frontier shard holds a domain's queue. Each shard is a separate Redis instance (or a custom queue service). Fetcher workers pull from multiple shards.

Metadata DB sharding: Shard by url_hash (consistent hashing). Each shard holds a range of url_hashes. Deduplication lookups are routed to the correct shard based on the URL's hash. This scales reads and writes linearly.

Geographic distribution: For a truly global crawl, deploy fetcher clusters in multiple regions (US, Europe, Asia). Each cluster crawls domains geographically close to it (lower latency, faster fetches, less bandwidth cost). A central coordinator assigns domain-to-region mappings.

Separate fetch and processing: At very high scale, separate the "fetch" stage (HTTP download) from the "process" stage (link extraction, dedup, storage). Fetchers write raw responses to a Kafka topic. Processing workers consume from Kafka, extract links, update the frontier, and store content. This decouples the two stages and allows independent scaling.

Handling Bursts

  • Kafka between stages: Kafka buffers fetched pages if the processing stage falls behind. Fetchers keep fetching; processors catch up when they can.
  • Adaptive rate limiting: If the metadata DB or S3 is under heavy load, fetcher workers automatically slow down (backpressure). This prevents cascading failures.

12. Reliability, Failure Handling, and Backpressure

Removing Single Points of Failure

  • Fetcher workers: Stateless. If one crashes, its in-flight URLs are returned to the frontier (via a visibility timeout -- the frontier re-exposes a URL if no completion acknowledgment arrives within 60 seconds). No data is lost.
  • URL Frontier: Sharded across multiple Redis instances with persistence (AOF or RDB snapshots). If a shard dies, it restarts from its latest snapshot. Some URLs may be re-enqueued (duplicates are caught by the deduplicator -- wasted work but no data corruption).
  • Metadata DB: Replicated with automatic failover. Write-ahead log ensures no committed data is lost.
  • S3: 11 nines of durability. Effectively indestructible.
  • Kafka: Multi-broker with replication factor 3.

Timeouts, Retries, and Idempotency

  • HTTP fetch timeout: 10 seconds per page. Slow servers do not block the worker -- the async I/O model means other fetches continue in parallel.
  • Retries: Failed fetches are re-enqueued with exponential backoff (10 min, 30 min, 2 hours). After 3 failures, the URL is marked as permanently failed and removed from the frontier.
  • Idempotency: Crawling is naturally idempotent. If we accidentally fetch the same page twice, the content_hash check in Step 4 detects the duplicate and avoids storing it again. At worst, we waste a fetch.

Backpressure

  • Frontier to fetchers: If all fetcher workers are busy, the frontier simply waits. URLs remain in the queue until a worker is available.
  • Fetchers to storage: If S3 uploads are slow, fetcher workers buffer responses in memory (up to a limit) and reduce their fetch rate. If the buffer fills, they stop pulling from the frontier.
  • Kafka backpressure: If downstream consumers (indexer) fall behind, Kafka retains messages. The crawler is unaffected -- it continues crawling and publishing.

Behavior Under Overload

  • Drop lower-priority URLs: If the frontier is overwhelmed, drop low-priority URLs (deep pages, low-value domains) to keep high-priority crawls on schedule.
  • Reduce crawl rate: Globally reduce the target QPS to give storage and processing time to catch up.
  • Skip re-crawls: Under extreme load, postpone re-crawl cycles and focus on new, never-seen URLs.

13. Security, Privacy, and Abuse

Authentication and Authorization

  • The crawler's control API (seed URLs, status, policies) is internal-only, protected by API keys and network-level access controls (VPN or private subnet).
  • No public-facing endpoints.

Encryption

  • In transit: All HTTP fetches to external sites use HTTPS where available. Internal communication between services uses TLS.
  • At rest: S3 server-side encryption for stored HTML content. Metadata DB disk encryption.

Responsible Crawling (Ethics and Politeness)

  • Respect robots.txt: Always check and obey. This is not just polite -- it may have legal implications.
  • Identify yourself: Use a descriptive User-Agent string with a link to a page explaining who you are and how to contact you. Site owners who want to block your crawler can do so by User-Agent in robots.txt.
  • Crawl rate limits: Never hammer a site. Enforce per-domain rate limits and back off on errors.
  • Do not crawl sensitive data: Avoid crawling pages behind login forms, personal data pages, or sites that have asked to be excluded.
  • Honor noindex and nofollow meta tags: If a page says <meta name="robots" content="noindex">, do not include it in downstream search results (though you may still need to fetch it to discover this tag).

Abuse Scenarios

  • Spider traps: Some sites generate infinite URLs (e.g., calendar pages with links to every future date). Mitigation: Enforce depth limits and per-domain page budgets. Detect and blocklist trap patterns.
  • Malicious content: Downloaded HTML may contain malware or exploit payloads. Mitigation: The crawler stores raw content but does not render or execute it. Processing workers parse HTML with a safe parser (not a browser). Isolate the processing environment.

14. Bottlenecks and Next Steps

Main Bottlenecks and Risks

  • URL frontier size and management. With billions of URLs, the frontier becomes the most complex component. Mitigation (already in design): Shard by domain. Next step: Build a custom distributed priority queue optimized for the crawler's access patterns (per-domain bucketing with rate limiting built in), rather than relying on general-purpose Redis.

  • Bloom filter staleness. The Bloom filter is rebuilt periodically, so newly crawled URLs may not be reflected immediately, causing redundant fetches. Mitigation: Add newly crawled url_hashes to the local Bloom filter in real time. Next step: Use a distributed Bloom filter (or a counting Bloom filter) that is updated across all workers, or switch to a distributed set like a sharded hash table.

  • Slow/unresponsive domains. Some websites respond very slowly, tying up fetcher connections. Mitigation (already in design): 10-second timeout, async I/O. Next step: Maintain a "slow domain" list and reduce their concurrency to 1 connection, freeing resources for faster domains.

  • DNS resolution bottleneck. At 385+ fetches/second across millions of domains, DNS lookups can become a bottleneck. Mitigation (already in design): Local DNS caching. Next step: Run a local DNS resolver (e.g., Unbound) on each fetcher machine to avoid depending on external DNS servers.

  • Storage costs at scale. 360 TB/year of compressed HTML adds up over multiple years. Next step: Implement content-level deduplication (many pages share identical boilerplate). Store shared boilerplate once and delta-encode page-specific content. Also implement lifecycle policies to delete very old versions.

Design Summary

AspectDecisionKey Trade-off
Crawl loopURL frontier -> fetcher workers -> link extraction -> dedup -> back to frontierSimple loop that scales horizontally; complexity is in the frontier management
URL frontierDistributed priority queue, sharded by domain, with per-domain rate limitingEnsures politeness and fairness; adds complexity vs. a simple FIFO queue
DeduplicationBloom filter (fast, approximate) + metadata DB (slow, definitive)~1% false positives (missed URLs) traded for massive speed improvement
Content storageS3 with gzip compressionCheap and durable; no query capability (metadata DB handles that)
Change detectionContent hashing on re-crawlAvoids storing duplicate content; re-crawl frequency adapts to change rate
Scaling pathAdd fetcher machines horizontally; shard frontier and metadata DBLinear scaling; each component can be scaled independently

This design is built around one central insight: a web crawler is an enormous loop -- fetch a page, extract links, enqueue them, repeat -- and the hard part is managing that loop at billion-URL scale.

The URL frontier (with per-domain politeness), the deduplication layer (Bloom filter + DB), and the storage layer (S3 for content, sharded DB for metadata) are the three pillars.

By keeping fetcher workers stateless and decoupling stages with Kafka, we can scale each piece independently and handle failures gracefully.

The system is polite by design, efficient with storage through content hashing, and adaptive through priority-based crawling and dynamic re-crawl scheduling.