7.1 AI/ML System Design - The System Design Interview Handbook

ML System Design Lifecycle: Data, Training, Serving, Monitoring

Building a machine learning model in a Jupyter notebook and deploying a machine learning system in production are two entirely different engineering challenges.

The model itself is often the smallest piece of the puzzle.

The infrastructure around it, collecting data, engineering features, training at scale, serving predictions, and monitoring model health, is where the real system design lives.

The ML lifecycle has four stages, and each one presents distinct infrastructure challenges.

Data

Everything starts with data.

A recommendation model needs user interaction data (clicks, purchases, ratings).

A fraud detection model needs transaction data with labeled outcomes.

A search ranking model needs queries paired with the results users clicked.

The data stage involves collection (instrumenting your application to emit the right events), storage (data lakes, data warehouses), processing (cleaning, joining, aggregating, transforming raw data into training datasets), and validation (checking for data quality issues like missing values, schema changes, distribution drift, and label errors).

The data pipeline is the most fragile part of the ML system.

A schema change in a source database, a logging bug that drops events, or a silent data quality degradation can corrupt your training data without any obvious error. Models trained on bad data produce bad predictions, and the problem might not surface until users complain.

Training

Training takes the processed dataset and produces a model.

For simple models (logistic regression, decision trees), training runs on a single machine in minutes.

For large models (deep neural networks, large language models), training runs on clusters of GPUs for days or weeks.

The training stage involves experiment management (tracking which hyperparameters, data versions, and code versions produced which results), infrastructure provisioning (allocating GPU clusters, managing training jobs), and reproducibility (ensuring you can recreate any previous model by replaying the same data, code, and configuration).

Serving

Serving is the stage where the trained model produces predictions for real users.

A serving system needs to be fast (predictions in milliseconds for real-time use cases), reliable (always available, with fallbacks for model failures), and scalable (handling thousands to millions of prediction requests per second).

Monitoring

Once in production, models degrade. User behavior changes. The world changes.

A model trained on pre-pandemic shopping data performs poorly during a pandemic.

A fraud model trained on historical patterns misses new fraud techniques.

Monitoring detects this degradation through metrics like prediction accuracy, feature distributions, and business outcomes.

When degradation is detected, it triggers retraining.

These four stages form a continuous loop, not a linear pipeline.

Monitoring discovers data drift, which triggers new data collection and processing, which triggers retraining, which produces a new model to serve.

ML Lifecycle

Feature Stores and Feature Engineering Pipelines

A feature is a measurable attribute that the model uses to make predictions.

For a fraud detection model, features might include transaction amount, time since last transaction, number of transactions in the past hour, device fingerprint, and geographic distance from the user's usual location.

Raw data becomes features through a transformation process called feature engineering.

The Problem Feature Stores Solve

In most organizations, feature engineering is duplicated across teams.

The fraud team computes "average transaction amount over 30 days" in their pipeline.

The recommendation team computes the same metric independently in a different pipeline using slightly different logic.

The two values might not even match because of different data sources or different computation methods. When a new team needs the same feature, they build it a third time.

A feature store is a centralized system that manages the computation, storage, and serving of features.

Teams define features once, the feature store computes and stores them, and any team or model can access them consistently.

Feature Store Architecture

A feature store has two components.

The offline store holds historical feature values used for model training.

When you train a model, you need features computed at past points in time (what was the user's 30-day average transaction amount at the moment each historical transaction occurred?). This is called point-in-time correctness, and getting it wrong causes a problem called data leakage where the model accidentally trains on information it would not have in production.

The online store holds the latest feature values for real-time serving. When a prediction request arrives, the serving system fetches the user's current features from the online store in milliseconds.

The online store is typically backed by a low-latency key-value store like Redis or DynamoDB.

Feature engineering pipelines compute features from raw data.

Batch pipelines (Spark, SQL) run periodically (hourly, daily) and write to the offline store. Streaming pipelines (Flink, Kafka Streams) compute features in real time and update the online store continuously.

Feast (open source), Tecton, and AWS SageMaker Feature Store are the leading feature store platforms.

Component	Purpose	Storage	Latency	Used For
Offline store	Historical features for training	Data warehouse, S3, Parquet files	Seconds to minutes (batch reads)	Model training, batch scoring
Online store	Latest features for serving	Redis, DynamoDB	Milliseconds	Real-time predictions

Model Training Infrastructure: Distributed Training, GPU Clusters

Training a simple model on a laptop is straightforward.

Training a deep neural network on terabytes of data across a cluster of GPUs is an infrastructure challenge that requires careful orchestration.

Single-Node Training

For models that fit in memory on a single machine (most tabular models, small neural networks), training is simple. You provision a machine with the right GPU (or CPU), load the data, run the training script, and save the resulting model.

Cloud platforms offer GPU instances (AWS p4d, Google Cloud A2, Azure NC) with NVIDIA GPUs optimized for ML workloads.

Distributed Training

When the model or the dataset is too large for a single machine, distributed training splits the work across multiple GPUs or machines.

Data parallelism is the most common approach. The dataset is split across multiple GPUs. Each GPU trains a copy of the full model on its subset of data. After each batch, the GPUs synchronize their gradients (using all-reduce communication patterns) so all copies converge to the same model. Data parallelism scales training throughput roughly linearly with the number of GPUs (doubling GPUs roughly halves training time), up to the point where communication overhead dominates.
Model parallelism splits the model itself across GPUs when it is too large to fit in a single GPU's memory. Different layers or segments of the model run on different GPUs. Data flows through them sequentially. Model parallelism is necessary for very large models (like LLMs with billions of parameters) but is more complex to implement than data parallelism.
Pipeline parallelism combines aspects of both. The model is split across GPUs (like model parallelism), but multiple micro-batches are processed concurrently, with each GPU working on a different micro-batch at each step (like an assembly line). This keeps all GPUs busy and reduces idle time.

GPU Cluster Management

Training clusters are expensive.

A single NVIDIA A100 GPU costs roughly $3 per hour on AWS. A training job using 64 GPUs costs $192 per hour, $4,608 per day.

Efficient cluster management is critical.

Job schedulers like Kubernetes with GPU support, SLURM (common in research), and managed services (AWS SageMaker, Google Vertex AI, Azure ML) allocate GPU resources to training jobs, queue jobs when resources are busy, and preempt lower-priority jobs when higher-priority ones arrive.
Spot/preemptible instances reduce GPU costs by 60-90% but can be reclaimed at any time. Training frameworks need checkpointing: periodically saving the model state to persistent storage so that if the instance is reclaimed, training can resume from the last checkpoint rather than restarting from scratch.

Model Serving: Batch vs. Real-Time Inference

Once a model is trained, it needs to make predictions on new data.

How predictions are served depends on the latency requirements of the use case.

Batch Inference

Batch inference runs the model over a large dataset periodically and stores the predictions.

A recommendation system might score all users against all products overnight and store the results in a database.

When a user opens the app, their precomputed recommendations are read from the database with no model execution at request time.

Batch inference is simple and cost-effective.

The model runs on a schedule (hourly, daily), uses cheap compute resources (spot instances), and predictions are served with database-level latency (milliseconds).

The trade-off is freshness: predictions reflect the state of the data at the time of the batch run, not the current moment.

Real-Time Inference

Real-time inference executes the model on each request, producing predictions in milliseconds.

A fraud detection system scores each transaction as it arrives.

A search ranking model scores results for each query in real time.

A pricing model computes dynamic prices based on current demand.

Real-time inference requires a model serving infrastructure that is fast (predictions under 50ms for user-facing applications), scalable (handling thousands of requests per second), and reliable (always available with fallbacks if the model service is down).

Model serving frameworks include TensorFlow Serving (Google, serves TensorFlow models over gRPC/REST), TorchServe (PyTorch's serving solution), Triton Inference Server (NVIDIA, supports multiple frameworks, optimized for GPU inference), Seldon Core (Kubernetes-native model serving), and BentoML (framework-agnostic, packages models as containers).

Choosing Between Batch and Real-Time

Aspect	Batch Inference	Real-Time Inference
When predictions run	On a schedule (hourly/daily)	On each request
Latency	Database read (milliseconds)	Model execution (10-100ms)
Freshness	Hours old	Current
Cost	Lower (scheduled compute)	Higher (always-on serving)
Complexity	Lower	Higher (serving infrastructure)
Best for	Recommendations, reports, email personalization	Fraud detection, search ranking, dynamic pricing

Many systems use both.

Batch inference generates baseline recommendations (cheap, pre-computed).

Real-time inference adjusts the recommendations based on the user's current session behavior (responsive, current).

ML Model Versioning and Experiment Tracking (MLflow, Weights & Biases)

A production ML system produces dozens of model versions over its lifetime.

Each version used different data, different features, different hyperparameters, and different code.

Without a system for tracking these experiments and their outcomes, reproducing a model or understanding why it behaves differently from the previous version is nearly impossible.

What Experiment Tracking Records

For each training run, the tracking system records the code version (git commit hash), the dataset version (or a hash of the training data), all hyperparameters (learning rate, batch size, number of layers, regularization), the resulting metrics (accuracy, precision, recall, AUC, loss), the model artifact (the serialized model file), and any metadata (training duration, GPU type, notes).

MLflow

MLflow (open source, by Databricks) is the most widely adopted experiment tracking platform.

It provides four components: Tracking (logging parameters, metrics, and artifacts for each run), Projects (packaging code for reproducible runs), Models (a standard format for packaging models with their dependencies), and Model Registry (managing model lifecycle stages: staging, production, archived).

A typical workflow: a data scientist runs an experiment locally. MLflow logs the parameters and metrics automatically.

The data scientist reviews the results in the MLflow UI, compares multiple runs side by side, picks the best model, and promotes it to the "production" stage in the Model Registry.

The serving infrastructure pulls the latest production model from the registry and deploys it.

Weights & Biases (W&B)

Weights & Biases is a managed platform for experiment tracking, visualization, and collaboration. It is particularly strong in visualization: interactive dashboards that show training curves, hyperparameter sweeps, and model comparisons.

W&B also provides dataset versioning, model evaluation tables, and integration with Jupyter notebooks.

W&B is popular in research and startups for its ease of use.

MLflow is more common in larger organizations that prefer open-source, self-hosted solutions.

Tool	Hosting	Strengths	Best For
MLflow	Self-hosted or managed (Databricks)	Open source, model registry, broad adoption	Production ML systems, enterprise
Weights & Biases	Managed (cloud)	Visualization, collaboration, ease of use	Research teams, startups

A/B Testing and Shadow Scoring for ML Models

Deploying a new ML model is not like deploying a new API endpoint.

A new model can subtly degrade predictions in ways that are not caught by offline metrics.

A model that scores 2% higher on accuracy in an offline evaluation might perform worse in production because of distributional differences between the evaluation dataset and real traffic.

A/B Testing for Models

A/B testing applies directly to ML models.

Deploy the new model alongside the current production model.

Route a percentage of traffic (5-10%) to the new model and the rest to the current model.

Measure the business outcome (click-through rate, conversion rate, revenue per user) for each group over a statistically significant period.

If the new model produces better outcomes, gradually roll it out to 100%.

A/B testing measures the real business impact of a model change, not just technical metrics like accuracy or AUC.

A model might have better accuracy but worse user experience because it produces less diverse recommendations.

Shadow Scoring

Shadow scoring (also called shadow mode or dark launching) runs the new model on production traffic without serving its predictions to users. Both the current model and the new model receive the same inputs.

The current model's predictions are served to users. The new model's predictions are logged and compared to the current model's predictions offline.

Shadow scoring lets you evaluate the new model under real production conditions (real data distributions, real feature values, real traffic patterns) without any risk of degrading user experience.

You analyze the logged predictions to understand: how often the two models disagree, whether the new model's predictions are better (by replaying user outcomes), and whether the new model has unexpected behaviors (like always predicting the same category).

After shadow scoring confirms the new model is performing well, you transition to an A/B test for the final business impact validation, and then full deployment.

Recommendation System Architecture

Recommendation systems are one of the most common ML applications in production.

Every major platform uses them: product recommendations (Amazon), content recommendations (Netflix, YouTube), social connections (LinkedIn, Facebook), and music recommendations (Spotify).

Two-Stage Architecture

Production recommendation systems almost always use a two-stage pipeline: candidate generation followed by ranking.

Candidate generation

Candidate generation reduces the full catalog (millions of items) to a manageable set of candidates (hundreds).

This stage uses fast, simple models or heuristics: collaborative filtering (users who bought X also bought Y), content-based filtering (items similar to what you have interacted with), popularity-based candidates (trending items), and embedding-based retrieval (find items whose vector embeddings are nearest to the user's embedding in a vector database).

Candidate generation prioritizes recall (do not miss any good recommendations) over precision (it is okay to include some mediocre ones).

Speed matters: this stage must process millions of items in milliseconds using approximate nearest neighbor search (FAISS, Annoy, ScaNN) or precomputed candidate lists.

Ranking

Ranking takes the candidate set (hundreds of items) and scores each one with a more sophisticated model.

A deep neural network might consider the user's full interaction history, the item's features, contextual signals (time of day, device type, location), and cross-features (how this user segment interacts with this item category).

The ranking model produces a score for each candidate, and the top-N highest-scored items are shown to the user.

Ranking prioritizes precision (the items shown should be genuinely relevant).

Since it only scores hundreds of candidates (not millions), it can afford to use a computationally expensive model.

Architecture Components

A production recommendation system includes a user profile store (user features, interaction history, preferences stored in a low-latency store like Redis or DynamoDB), an item catalog (item features, embeddings, metadata), a feature store (precomputed features for both users and items), a candidate generation service (retrieves candidates from multiple sources, deduplicates, filters unavailable items), a ranking service (runs the ranking model on candidates, applies business rules like diversity and freshness), and a re-ranking layer (applies final business rules: suppress items the user already purchased, ensure category diversity, mix in sponsored content).

Recommendation Pipeline

Explore Grokking the System Design Interview course for complete interview prep.

Search Ranking and Personalization Pipelines

Search ranking has an ML dimension: learning to rank.

Instead of relying solely on BM25 text relevance, production search systems use ML models that combine text relevance with dozens of additional signals to produce a personalized ranking.

Learning to Rank

Learning to rank (LTR) trains a model to order search results by relevance.

The training data consists of queries, candidate results, and relevance labels (either explicit ratings or implicit signals like click-through rates and dwell time).

Three approaches exist.

Pointwise predicts a relevance score for each query-document pair independently.

Pairwise predicts which of two documents is more relevant for a query.

Listwise optimizes the entire ranking of results for a query.

Listwise approaches (like LambdaMART and neural LTR models) generally produce the best results because they directly optimize the ranking metric (NDCG, MRR) rather than individual scores or pairs.

Personalization

Personalization adjusts results based on who is searching.

A search for "apple" returns electronics for a tech enthusiast and recipes for a cooking enthusiast.

Personalization features include past search history, purchase history, browsing behavior, demographic information, and contextual signals (device, location, time).

The personalization pipeline typically adds user features to the ranking model's input.

The same ranking model scores documents, but the user features shift the scores based on individual preferences.

A user who frequently purchases premium products sees premium results ranked higher.

Offline/Online Pipeline

Personalization features are computed through a combination of offline pipelines (batch jobs that compute long-term user preferences, item popularity, and collaborative filtering scores) and online pipelines (real-time computation of session-level features like recently viewed items and current location).

The feature store bridges both, providing precomputed offline features and continuously updated online features to the ranking model at serving time.

Data Labeling and Annotation Platforms

Supervised machine learning requires labeled data.

A fraud detection model needs millions of transactions labeled as "fraud" or "not fraud."

An image classification model needs images labeled with their contents.

A sentiment analysis model needs text labeled as "positive," "negative," or "neutral."

The Labeling Challenge

High-quality labeled data is expensive and time-consuming to produce.

A single image might require a human annotator to draw bounding boxes around every object, taking 30 seconds to several minutes per image.

Labeling a million images for a computer vision project can cost hundreds of thousands of dollars.

Labeling Approaches

Manual Labeling

Manual labeling uses human annotators (internal teams or crowdsourced workers) to label data.

Crowdsourcing platforms like Amazon Mechanical Turk, Scale AI, and Labelbox provide workforces of annotators.

Quality control mechanisms include multiple annotators labeling the same item (majority vote determines the label), gold standard items (items with known correct labels mixed in to measure annotator accuracy), and inter-annotator agreement metrics (measuring how consistently different annotators label the same items).

Semi-supervised Labeling

Semi-supervised labeling uses a model to pre-label data, then human annotators review and correct the labels.

This is dramatically faster than labeling from scratch because correcting a label takes seconds while creating one takes minutes.

As the model improves, the pre-labels become more accurate and human correction effort decreases.

Active Learning

Active learning selects the most informative unlabeled examples for human annotation.

Instead of labeling random samples, the system identifies examples where the model is most uncertain (near the decision boundary) and prioritizes those for human review.

This gets more model improvement per labeled example, reducing the total labeling cost.

Weak Supervision

Weak supervision generates noisy labels programmatically using heuristics, rules, and existing knowledge bases.

The Snorkel framework formalizes this approach using "labeling functions" that each provide a noisy label.

The framework combines multiple noisy labels into a single probabilistic label that is more accurate than any individual labeling function.

Annotation Platforms

Platform	Type	Strengths	Best For
Scale AI	Managed service	High-quality labels, ML-assisted annotation	Production ML, autonomous vehicles, NLP
Labelbox	Platform	Collaboration, workflow management, model-assisted	Computer vision, document processing
Amazon SageMaker Ground Truth	AWS-managed	AWS integration, active learning, crowdsourcing	AWS-native ML workflows
Label Studio	Open source	Self-hosted, customizable, multi-data-type	Teams wanting full control, budget-conscious
Prodigy	Commercial (SpaCy)	Fast annotation, active learning, NLP-focused	NLP tasks, small teams

Beginner Mistake to Avoid

New ML engineers sometimes focus entirely on model architecture and training techniques while neglecting data quality.

A sophisticated deep learning model trained on poorly labeled data will be outperformed by a simple logistic regression model trained on clean, accurately labeled data.

Invest in labeling quality, data validation, and feature engineering before investing in model complexity.

The saying "garbage in, garbage out" applies more forcefully to ML than to any other area of software engineering.

Interview-Style Question

Q: You are designing a product recommendation system for an e-commerce platform with 10 million products and 50 million users. Walk through your high-level architecture.

A: Two-stage architecture with offline and online components. Offline: a nightly batch pipeline computes user embeddings and item embeddings using collaborative filtering on interaction data (views, purchases, ratings). Item features (category, price, brand, popularity) and user features (purchase history, browsing patterns, demographic segment) are computed and stored in a feature store. Online: when a user opens the app, the candidate generation service retrieves candidates from four sources in parallel: nearest-neighbor lookup in the embedding space (using a vector database like Pinecone querying against the user's embedding), "users who bought X also bought Y" collaborative filtering (precomputed, stored in Redis), items from categories the user frequently browses (content-based, from the feature store), and trending items (popularity, refreshed hourly). The merged candidate set (~500 items after deduplication) passes to the ranking service, which runs a deep ranking model that combines user features, item features, and contextual features (time of day, device, location) to score each candidate. The top 20 items are re-ranked with business rules: suppress already-purchased items, ensure category diversity, and mix in sponsored products. Total serving latency target: under 200ms. The ranking model is retrained weekly using the latest interaction data. New model versions are validated through shadow scoring for 48 hours, then A/B tested for 1 to 2 weeks before full deployment. Monitoring tracks click-through rate, conversion rate, and coverage (percentage of catalog that gets recommended) to detect model degradation.

KEY TAKEAWAYS

The ML lifecycle is a continuous loop: data collection, training, serving, monitoring, and back to data when drift is detected. The model is the smallest piece; the infrastructure around it is the real engineering challenge.
Feature stores centralize feature computation, ensuring consistency between training (offline store) and serving (online store) and eliminating duplicated feature engineering across teams.
Distributed training (data parallelism, model parallelism) enables training on datasets and models that exceed single-machine capacity. GPU costs demand efficient scheduling and checkpointing.
Batch inference is cheap and simple for use cases that tolerate hours-old predictions. Real-time inference is necessary for latency-sensitive applications like fraud detection and search ranking.
Experiment tracking (MLflow, W&B) records every training run's parameters, metrics, and artifacts. It enables reproducibility and informed model selection.
Shadow scoring tests new models under real production conditions without user impact. A/B testing measures actual business impact before full deployment.
Recommendation systems use a two-stage pipeline: fast candidate generation (recall-oriented) followed by precise ranking (precision-oriented).
Data labeling quality matters more than model sophistication. Invest in labeling infrastructure, quality control, and active learning before investing in complex model architectures.