## Full-Text Search Architecture: Crawling, Indexing, Querying
Search is one of those features that looks simple from the outside and is extraordinarily complex on the inside.
A user types a few words, hits enter, and expects relevant results in under 200 milliseconds. Making that happen across millions or billions of documents requires a three-stage pipeline that every search system follows.
**Crawling**
Crawling is the process of discovering and collecting content that should be searchable.
For a web search engine like Google, crawlers visit web pages, follow links, download the page content, and store it for processing.
For an internal application search (like searching products on an e-commerce site), crawling might mean reading from your database, consuming events from a change data capture stream, or polling an API for updates.
The crawler's job is to answer one question: what content exists and what does it say?
It extracts text, metadata, and structure from each source document. A product crawler might extract the title, description, category, price, brand, and customer reviews from each product listing.
Crawling must also handle updates.
Content changes.
Products go out of stock.
Blog posts get edited.
Web pages disappear.
The crawler needs a strategy for re-visiting sources to detect changes.
Frequency depends on how often the content changes: a news site might be re-crawled every few minutes, while a product catalog might be re-indexed hourly.
**Indexing**
Indexing transforms raw content into a data structure optimized for fast search. The most common structure is the inverted index (covered in detail below).
During indexing, the system processes each document through a text analysis pipeline: tokenizing text into individual terms, normalizing those terms (lowercasing, removing punctuation, stemming), and storing the mapping of terms to the documents that contain them.
Indexing is the most compute-intensive phase. Building an index over billions of documents takes significant processing power and storage. But once the index is built, queries against it are extremely fast because the index eliminates the need to scan every document.
**Querying**
Querying is what the user actually sees.
A search query comes in, the system looks up the query terms in the index, finds the matching documents, ranks them by relevance, and returns the top results. This entire process happens in milliseconds for a well-built search system.
The query pipeline typically includes query parsing (understanding what the user means, handling special syntax), term lookup (finding matching documents in the index), relevance scoring (ranking the results), and post-processing (applying filters, facets, highlighting matched terms in snippets).
_3-Stage Search Pipeline_
## Inverted Indexes and Tokenization
The inverted index is the data structure at the heart of every search engine. Understanding how it works is essential for designing any system that involves full-text search.
**How an Inverted Index Works**
A regular index maps documents to the words they contain. Document 1 contains "the quick brown fox." Document 2 contains "the lazy brown dog." A regular index would look like a table of contents: Document 1 → \[the, quick, brown, fox\], Document 2 → \[the, lazy, brown, dog\].
An inverted index flips this relationship. It maps each word to the list of documents that contain it.
| Term | Document IDs |
|---|---|
| the | \[1, 2\] |
| quick | \[1\] |
| brown | \[1, 2\] |
| fox | \[1\] |
| lazy | \[2\] |
| dog | \[2\] |
When a user searches for "brown fox," the system looks up "brown" (documents 1 and 2\) and "fox" (document 1), intersects the results, and returns Document 1\. No scanning through documents is needed.
The lookup is essentially a dictionary lookup per term, followed by a set intersection. This is why search engines can query billions of documents in milliseconds.
Each entry in the inverted index can also store additional information beyond just the document ID: the position of the term within the document (enabling phrase matching), the frequency of the term (used for relevance scoring), and field-level metadata (whether the term appeared in the title, body, or tags).
**Tokenization and Text Analysis**
Before content enters the inverted index, it passes through a text analysis pipeline that transforms raw text into searchable terms.
Tokenization splits text into individual tokens (usually words). "The quick brown fox" becomes \["The", "quick", "brown", "fox"\]. Tokenization decisions matter. Should "New York" be one token or two? Should "user-friendly" be split at the hyphen? Different tokenizers handle these edge cases differently.
Lowercasing normalizes case so that "Brown," "brown," and "BROWN" all match. Virtually every search system applies this.
Stop word removal strips common words like "the," "is," "and," "a" that appear in nearly every document and add no search value. Removing them reduces index size and speeds up queries. Some modern search engines have moved away from stop word removal because it can hurt phrase searches ("to be or not to be" loses all meaning without stop words).
Stemming reduces words to their root form. "Running," "runs," and "ran" all stem to "run." This lets a search for "running" match documents containing "runs." Stemming algorithms (like the Porter Stemmer) use rules to strip suffixes. They are fast but imperfect. "University" and "universe" might incorrectly stem to the same root.
Lemmatization is a more accurate alternative to stemming. It uses vocabulary knowledge and morphological analysis to find the correct base form. "Better" lemmatizes to "good" (which stemming would miss). Lemmatization is slower but produces better results for languages with complex morphology.
| Analysis Step | Input | Output | Purpose |
|---|---|---|---|
| Tokenization | "The Quick-Brown fox" | \["The", "Quick", "Brown", "fox"\] | Split text into terms |
| Lowercasing | \["The", "Quick", "Brown", "fox"\] | \["the", "quick", "brown", "fox"\] | Case-insensitive matching |
| Stop word removal | \["the", "quick", "brown", "fox"\] | \["quick", "brown", "fox"\] | Remove noise terms |
| Stemming | \["quick", "brown", "fox"\] | \["quick", "brown", "fox"\] | Reduce to root forms (no change here) |
The same analysis pipeline must be applied to both the indexed documents and the search queries. If you stem "running" to "run" during indexing but search for the un-stemmed "running" at query time, you will not find a match. Consistency between index-time and query-time analysis is critical.
**Elasticsearch and Apache Solr**
Two search engines dominate the landscape for application search. Both are built on top of Apache Lucene, the underlying library that provides the inverted index, text analysis, and query execution engine.
**Elasticsearch**
Elasticsearch is the most widely used search engine for application-level search. It stores data as JSON documents, provides a powerful query DSL (Domain-Specific Language) for full-text search, filtering, and aggregation, and distributes data across a cluster of nodes for horizontal scalability.
Elasticsearch organizes data into indexes (analogous to database tables). Each index is divided into shards, and each shard is a self-contained Lucene index. Shards are distributed across nodes in the cluster, and each shard can have one or more replicas for fault tolerance and read throughput.
Beyond search, Elasticsearch is widely used for log analysis (the "E" in the ELK stack: Elasticsearch, Logstash, Kibana), metrics and monitoring, and analytics on semi-structured data. Its near-real-time indexing (new documents are searchable within 1 second of being indexed) makes it suitable for applications where freshness matters.
**Apache Solr**
Solr is the other major search platform built on Lucene. It predates Elasticsearch and has been in production at large organizations for over 15 years. Solr provides similar full-text search capabilities, faceting, highlighting, and distributed search through its SolrCloud mode.
Solr's strengths include its mature faceted search capabilities, its robust caching layer, and its stability in large-scale deployments. It has traditionally been stronger than Elasticsearch for structured data search and complex faceted navigation.
**Choosing Between Them**
| Aspect | Elasticsearch | Solr |
|---|---|---|
| Data format | JSON-native, schema-free (dynamic mapping) | XML/JSON, schema-defined (more structured) |
| Query language | JSON-based Query DSL | Lucene query syntax, JSON API |
| Real-time indexing | Near-real-time (\< 1 second) | Near-real-time (configurable) |
| Ecosystem | ELK stack, APM, SIEM | Strong standalone, Hadoop integration |
| Community | Larger, more active | Mature, stable |
| Scaling | Easier shard management | SolrCloud (more manual configuration) |
| Best for | Log analytics, general-purpose search, modern apps | Enterprise search, complex faceted navigation |
For most new projects, Elasticsearch is the default choice because of its larger community, simpler scaling model, and richer ecosystem. Solr remains a solid choice for organizations already invested in its ecosystem or for use cases that lean heavily on faceted search.
**Search Ranking and Relevance Scoring (TF-IDF, BM25)**
Finding documents that match a query is the easy part. Ranking them so the most relevant results appear first is the hard part. Two algorithms form the foundation of relevance scoring in almost every search engine.
**TF-IDF (Term Frequency-Inverse Document Frequency)**
TF-IDF scores how important a word is to a specific document within a collection of documents. It combines two signals.
Term Frequency (TF) measures how often a term appears in a document. A document that mentions "kubernetes" 15 times is probably more about Kubernetes than one that mentions it once. Higher frequency within the document means higher relevance.
Inverse Document Frequency (IDF) measures how rare a term is across all documents. A term that appears in nearly every document (like "the" or "system") is not useful for distinguishing relevance. A term that appears in only 5 out of 10,000 documents (like "kubernetes") is highly discriminating. The rarer the term across the corpus, the more weight it carries.
The TF-IDF score for a term in a document is: TF × IDF. A term that is frequent in the document and rare across all documents gets the highest score. A term that is common everywhere gets a low score regardless of how often it appears in any single document.
**BM25 (Best Match 25\)**
BM25 is the evolution of TF-IDF that addresses its shortcomings. TF-IDF has a linear relationship with term frequency: a document with "kubernetes" 100 times scores 10x higher than one with it 10 times. In practice, the difference between 10 mentions and 100 mentions should not be that significant. After a certain point, additional occurrences add diminishing value.
BM25 introduces a saturation curve for term frequency. The first few occurrences of a term boost the score significantly. Additional occurrences have diminishing impact. A document with "kubernetes" 10 times scores much higher than one with it 1 time, but a document with 100 mentions scores only marginally higher than one with 10\.
BM25 also accounts for document length. A 10,000-word article that mentions "kubernetes" 10 times is less focused on the topic than a 200-word article that mentions it 10 times. BM25 normalizes term frequency by document length so that shorter, focused documents are not disadvantaged against longer, less focused ones.
BM25 is the default ranking algorithm in both Elasticsearch and Solr. It replaced TF-IDF as the industry standard because it produces consistently better rankings across diverse search workloads.
Interview-Style Question
> Q: A user searches for "best Italian restaurant" on your local business search platform. How does the ranking system decide which results to show first?
> A: BM25 handles the text relevance component: businesses whose names and descriptions contain "Italian" and "restaurant" prominently (high TF) and where those terms are relatively discriminating (IDF) score higher. But text relevance alone is insufficient. The ranking system combines BM25 with additional signals: user ratings and review count (popularity), geographic proximity to the user (a restaurant 1 mile away ranks higher than one 20 miles away), business hours (open restaurants rank higher than closed ones), and possibly personalization (the user's past searches or preferences). The final score is a weighted combination of text relevance and these business-specific signals. The weights are tuned through offline evaluation and A/B testing.
**Typeahead / Autocomplete Systems**
Typeahead is the feature that shows suggestions as you type in a search box. You type "how to le" and the system suggests "how to learn system design," "how to learn python," and "how to learn guitar." The suggestions appear within 50 to 100 milliseconds of each keystroke.
**How Typeahead Works**
The core data structure for typeahead is the trie (prefix tree). A trie stores strings character by character, with each path from the root to a leaf representing a complete string. Sharing common prefixes across many strings makes the trie memory-efficient and extremely fast for prefix lookups.
When the user types "sys," the system traverses the trie from root → s → y → s and finds all completions that branch from that node: "system design," "system architecture," "system administrator." Each suggestion is annotated with a popularity score (based on search frequency, recent trends, or personalization) so the most popular completions appear first.
For large-scale systems, the trie is too large to fit in a single server's memory. The suggestions are precomputed and distributed across multiple nodes. Common approaches include sharding by prefix (one node handles all prefixes starting with "a" through "f," another handles "g" through "m") and caching the most popular prefixes in Redis or Memcached.
**Design Considerations**
Latency budget. Typeahead must respond in under 100ms because the user is typing continuously. If the response takes 300ms, the suggestions feel sluggish and the user has already typed more characters, making the suggestions stale.
Update frequency. Trending searches change constantly. A typeahead system needs a mechanism to incorporate new popular queries. Some systems update the trie in real time from a stream of recent searches. Others rebuild the trie periodically (every 15 minutes to every few hours) from aggregated search logs.
Personalization. Generic typeahead shows the same suggestions to every user. Personalized typeahead weights suggestions based on the user's search history, location, and preferences. Combining a global trie (popular suggestions) with a per-user trie (personalized suggestions) and merging the results produces the most relevant experience.
Offensive content filtering. Autocomplete suggestions appear before the user has fully committed to a search. Suggesting offensive or inappropriate completions is a significant product risk. Typeahead systems maintain a blocklist of terms that should never appear as suggestions, and new suggestions are typically reviewed or filtered before entering the trie.
**Faceted Search and Filtering**
Faceted search lets users narrow their search results by selecting attributes without typing a new query. On an e-commerce site, after searching for "laptop," you see facets on the side: brand (Apple, Dell, Lenovo), price range ($500-$1000, $1000-$2000), screen size (13", 15", 17"), and rating (4 stars and above). Clicking a facet refines the results instantly.
**How Facets Work**
Facets are computed by aggregating the values of specific fields across all documents that match the current query. For a search returning 5,000 laptops, the system counts how many have brand="Apple" (1,200), brand="Dell" (900), brand="Lenovo" (800), and so on. These counts are displayed alongside the facet values so the user knows how many results each filter produces.
Elasticsearch and Solr both support facets natively through their aggregation APIs. Faceted search requires the faceted fields to be indexed in a specific way (as keyword fields, not analyzed text) so that "Apple" and "apple" are not treated as separate facet values.
**Performance Considerations**
Computing facet counts across large result sets is expensive.
If a query matches 2 million documents and you need facet counts for 10 attributes, each with 50 possible values, the system is performing significant aggregation work.
Optimization strategies include pre-computing facet counts for popular queries and caching them, limiting the number of facet values returned (show the top 10 brands, not all 500), and using approximate counts for very large result sets where exact numbers are not critical.
Faceted search also interacts with filtering. When a user selects "Brand: Apple," the system applies the filter and recomputes facet counts for the remaining attributes based on the filtered result set. This means every facet click triggers a new aggregation query. Keeping this fast requires careful index design and often dedicated infrastructure for aggregation-heavy workloads.
**Fuzzy Matching and Spell Correction**
Users make typos. They search for "restrant" instead of "restaurant," "iphne" instead of "iphone," and "kubernets" instead of "kubernetes." A search system that returns zero results for these queries fails the user. Fuzzy matching and spell correction handle these cases.
**Edit Distance (Levenshtein Distance)**
The mathematical foundation of fuzzy matching is edit distance: the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. "cat" to "car" has an edit distance of 1 (substitute t→r). "kitten" to "sitting" has an edit distance of 3\.
Elasticsearch supports fuzzy queries natively. A fuzzy search for "kubernets" with a maximum edit distance of 2 will match "kubernetes" because only one character is missing. The system finds terms in the index that are within the specified edit distance of the query term.
The trade-off is that higher edit distances produce more false positives. An edit distance of 1 is precise and fast. An edit distance of 2 catches most typos but may return irrelevant matches. An edit distance of 3 is almost always too broad for practical use.
**Spell Correction Approaches**
Did-you-mean suggestions show a corrected query above the results. "Showing results for 'restaurant'. Search instead for 'restrant'?" The correction is computed by finding the closest match in a dictionary of known terms (the search engine's own index vocabulary or a separate dictionary), weighted by term popularity. "Restaurant" is a far more common search term than "restrant," so it is the obvious correction.
Automatic correction silently corrects the query without asking the user. The system detects that "restrant" is not in the index, finds that "restaurant" is the closest popular term, and searches for "restaurant" instead. This is more aggressive but provides a smoother experience when the correction is confident.
Phonetic matching handles cases where the user knows how a word sounds but not how it is spelled. Algorithms like Soundex, Metaphone, and Double Metaphone convert words into phonetic codes. "Smith" and "Smyth" produce the same phonetic code, so a search for one matches the other. Phonetic matching is especially useful for name search in contact directories and customer databases.
**Distributed Search at Scale**
A single Elasticsearch node can handle millions of documents for a small application. But when your index grows to billions of documents, or your query volume reaches thousands per second, a single node is not enough.
Distributing search across a cluster of machines while maintaining fast query response is one of the most interesting challenges in system design.
**Sharding the Index**
The primary technique for scaling search is sharding. The index is split into multiple shards, each containing a subset of the documents. Shards are distributed across nodes in the cluster.
When a query arrives, it is executed against all shards in parallel, and the results from each shard are merged into a single ranked list.
The number of shards determines the maximum parallelism for a single query and the maximum size of the index.
An index with 10 shards can process a query 10x faster than a single shard (because each shard searches its subset in parallel) and can store 10x more data.
Choosing the right number of shards is a trade-off. Too few shards limits parallelism and capacity. Too many shards creates overhead: each shard consumes memory for its data structures, and merging results from hundreds of shards adds latency.
A common guideline for Elasticsearch is to keep each shard between 10 GB and 50 GB and to avoid exceeding a few hundred shards per node.
**Replication for Throughput and Fault Tolerance**
Each shard can have one or more replicas. Replicas serve two purposes. They provide fault tolerance (if the primary shard's node fails, a replica is promoted). They also increase read throughput (queries can be served by any replica, distributing the load across more nodes).
For read-heavy search workloads, adding replicas is the simplest way to scale query throughput.
If a single replica of a shard handles 500 queries per second and you need to handle 2,000, add three more replicas.
**The Scatter-Gather Pattern**
Distributed search follows the scatter-gather pattern.
A coordinator node receives the query, scatters it to all shards (or their replicas), each shard searches its local data and returns its top results, and the coordinator gathers and merges the results from all shards into a final ranked list.
The merge step is where distributed search gets subtle. Each shard returns its top N results based on local scoring. The coordinator merges N results from each of S shards, re-ranks the combined set, and returns the global top N.
This works correctly for scoring algorithms like BM25 where document-level scores can be compared across shards, but it requires that term statistics (IDF values) are either global (computed across the entire index) or closely approximated by local shard statistics.
**Query Routing and Optimization**
Not every query needs to hit every shard. If you shard your product index by category, a query with a category filter ("search laptops in Electronics") can be routed to only the Electronics shard, avoiding the scatter to all other shards entirely. This is called routing and dramatically reduces query cost for filtered searches.
Time-based indexes are another optimization. For log search (the primary use case for the ELK stack), indexes are typically created per time period (one index per day or per week). A query for "errors in the last hour" only needs to search the most recent index, not the entire history.
**Challenges at Scale**
Index update latency. When a product's price changes, how quickly is the change reflected in search results? Elasticsearch's near-real-time indexing has a configurable refresh interval (default 1 second). For most applications, a 1-second delay is invisible. For real-time trading or auction platforms, it might not be fast enough.
Cluster management. A large Elasticsearch cluster (hundreds of nodes, thousands of shards) requires careful management. Shard allocation, node capacity planning, rolling upgrades, and index lifecycle management all need automation and monitoring.
Relevance tuning. Ranking quality is not a solved problem. Default BM25 scores are a starting point. Production search systems layer additional signals (popularity, freshness, personalization, geographic relevance) and continuously A/B test ranking changes to improve result quality.
**Beginner Mistake to Avoid**
New engineers sometimes use their primary database (PostgreSQL's full-text search, MongoDB's text indexes) for search and wonder why it is slow and the results are poor. Database-native search features are useful for basic use cases (searching a few thousand records), but they lack the inverted index optimization, relevance scoring, faceting, fuzzy matching, and distributed scaling that purpose-built search engines provide.
If search is a core feature of your application and your dataset exceeds a few hundred thousand records, use a dedicated search engine (Elasticsearch or Solr) and keep your database for transactional data.
Interview-Style Question
> Q: You are designing the search system for an e-commerce platform with 50 million products. Users should see results in under 200ms with support for full-text search, faceted filtering, and typo correction. Walk through your high-level design.
> A: Use Elasticsearch as the search engine. Index all 50 million products with fields for title, description, category, brand, price, rating, and availability. Shard the index into 20 to 30 shards (targeting 20 to 30 GB per shard) distributed across a cluster of 10 to 15 nodes. Add one replica per shard for fault tolerance and doubled read throughput. For the indexing pipeline, consume product change events from a Kafka topic (published by the product service whenever a product is created, updated, or deleted) and update the Elasticsearch index in near-real-time. For querying, use BM25 for text relevance combined with custom scoring that boosts popular products, in-stock products, and products with high ratings. Enable fuzzy matching with an edit distance of 1 for typo tolerance. Use Elasticsearch aggregations for faceted search on category, brand, price range, and rating. Place a Redis cache in front of Elasticsearch for the most common queries (the top 1,000 queries typically account for 30 to 50 percent of traffic). Typeahead runs against a precomputed trie of popular search terms, served from Redis or a dedicated in-memory service. Total query latency target: under 100ms at the 95th percentile from Elasticsearch, plus 10 to 20ms for network overhead, well under the 200ms budget.
_Distributed Search Architecture_
**KEY TAKEAWAYS**
* Full-text search follows a three-stage pipeline: crawling (collecting content), indexing (building the inverted index), and querying (searching and ranking results).
* The inverted index maps terms to documents, enabling millisecond lookups across billions of documents. Consistent text analysis (tokenization, stemming, lowercasing) between indexing and querying is critical. * Elasticsearch is the default choice for most application search. Solr is a mature alternative strong in faceted search. Both are built on Apache Lucene. * BM25 is the industry-standard relevance scoring algorithm. It improves on TF-IDF with term frequency saturation and document length normalization. * Typeahead systems use tries for prefix lookup and must respond in under 100ms. Precompute suggestions and cache aggressively. * Faceted search computes attribute counts across search results for filtering. Keep facet fields indexed as keywords, not analyzed text. * Fuzzy matching uses edit distance to handle typos. Spell correction suggests or silently applies corrections based on dictionary proximity and term popularity. * Distributed search scales through sharding (parallelism and capacity) and replication (throughput and fault tolerance). The scatter-gather pattern executes queries across shards in parallel.
For a detailed system design understanding, check out Grokking the System Design Interview.