LLM Serving Architecture: GPU Management, Batching, KV Cache
Serving a large language model is fundamentally different from serving a traditional ML model.
A classification model takes an input and returns a prediction in one shot.
An LLM generates text token by token, with each token depending on all the tokens before it.
This sequential generation process is computationally expensive, memory-intensive, and introduces unique infrastructure challenges.
GPU Management
LLMs run on GPUs because the matrix multiplication operations at the core of transformer models are massively parallelizable on GPU architectures.
A 7-billion-parameter model requires approximately 14 GB of GPU memory in half-precision (FP16) format just to hold the model weights.
A 70-billion-parameter model needs about 140 GB, which exceeds the memory of a single GPU (an NVIDIA A100 has 80 GB).
Larger models must be split across multiple GPUs using tensor parallelism (splitting individual layers across GPUs) or pipeline parallelism (assigning different layers to different GPUs).
GPU utilization is the critical efficiency metric.
GPUs are expensive ($2-4 per hour for an A100 on cloud platforms), and an idle GPU is wasted money.
The challenge is that LLM inference involves alternating phases: the initial prompt processing phase (compute-heavy, high GPU utilization) and the token generation phase (memory-bandwidth-heavy, lower GPU utilization because each step generates only one token). Keeping GPUs busy across these phases requires batching.
Batching Strategies
Without batching, the GPU processes one request at a time. While generating tokens for request A, the GPU sits partially idle because a single request does not fully utilize its compute capacity. Batching groups multiple requests together so the GPU processes them simultaneously.
- Static batching collects requests into fixed-size batches and processes them together. The problem is that different requests have different lengths. A batch waits until the longest request finishes generating, while shorter requests sit idle with their completed outputs. This wastes both GPU time and user-perceived latency for the shorter requests.
- Continuous batching (also called dynamic batching or iteration-level batching) solves this by inserting new requests into the batch as soon as a slot opens. When request A finishes generating, request D takes its place in the batch immediately without waiting for requests B and C to finish. The GPU stays busy, and no request waits longer than necessary.
- vLLM (developed at UC Berkeley) implements continuous batching along with PagedAttention, a memory management technique that eliminates wasted memory from fragmented KV caches. It is the most widely used open-source LLM serving engine. TensorRT-LLM (NVIDIA) optimizes inference specifically for NVIDIA GPUs with kernel fusion and quantization. Text Generation Inference (TGI) by Hugging Face provides a production-ready serving solution with batching and streaming support.
KV Cache
During text generation, the transformer model computes attention over all previous tokens at each step.
Without optimization, generating the 100th token requires re-processing all 99 previous tokens, which is wasteful because their attention values have not changed.
The KV (Key-Value) cache stores the attention key and value tensors for all previously generated tokens.
When generating a new token, the model only computes the attention for the new token and looks up the cached values for all previous tokens.
This reduces the computation from O(n^2) to O(n) per token.
The trade-off is memory.
The KV cache grows with sequence length and batch size.
For a large model serving long sequences to many concurrent users, the KV cache can consume more GPU memory than the model weights themselves.
Managing KV cache memory efficiently is one of the primary engineering challenges in LLM serving.
PagedAttention (used by vLLM) addresses this by managing KV cache memory in pages, similar to how operating systems manage virtual memory, reducing fragmentation and waste.
| Serving Engine | Batching | KV Cache Optimization | Strengths |
|---|---|---|---|
| vLLM | Continuous batching | PagedAttention | High throughput, memory efficient, open source |
| TensorRT-LLM | Continuous batching | In-flight batching | NVIDIA-optimized, low latency |
| TGI (Hugging Face) | Continuous batching | Standard | Easy setup, Hugging Face integration |
RAG (Retrieval-Augmented Generation) Pipelines
LLMs have a knowledge cutoff date. They cannot answer questions about events that happened after their training data was collected. They also cannot access private data: your company's documentation, your customer records, or your product catalog.
RAG solves this by retrieving relevant information from external sources and injecting it into the LLM's prompt before generation.
How RAG Works
A RAG pipeline has three steps. Indexing (offline): your documents (knowledge base articles, product descriptions, internal wikis, PDF manuals) are split into chunks, each chunk is converted into a vector embedding using an embedding model, and the embeddings are stored in a vector database.
Retrieval (at query time): when a user asks a question, the query is converted into an embedding using the same embedding model. The vector database performs a similarity search, finding the document chunks whose embeddings are closest to the query embedding. The top-K most relevant chunks are retrieved.
Generation: the retrieved chunks are inserted into the LLM's prompt as context. The prompt might look like: "Based on the following information: [chunk 1] [chunk 2] [chunk 3]. Answer the user's question: [user query]." The LLM generates a response grounded in the provided context rather than relying solely on its training data.
RAG Architecture Components
A production RAG system includes a document ingestion pipeline (watching for new or updated documents, chunking them, generating embeddings, and storing them in the vector database), a chunk processing layer (cleaning text, splitting into appropriately sized chunks with overlap to preserve context across chunk boundaries), an embedding service (calling an embedding model like OpenAI's text-embedding-3-small, Cohere's embed, or a self-hosted model), a vector database (Pinecone, Weaviate, Milvus, pgvector, Qdrant) for storing and searching embeddings, a retrieval service (query embedding, similarity search, optional re-ranking of results), and an LLM service (constructing the prompt with retrieved context and generating the response).
RAG Challenges
Chunking strategy directly affects retrieval quality.
Chunks that are too small lose context.
Chunks that are too large dilute the relevant information with noise.
Typical chunk sizes range from 256 to 1024 tokens with 10-20% overlap between adjacent chunks.
Some systems use semantic chunking (splitting at paragraph or section boundaries) rather than fixed-size splits.
Retrieval quality determines the answer quality. If the retriever returns irrelevant chunks, the LLM generates hallucinated or incorrect answers confidently. Hybrid retrieval (combining vector similarity search with keyword search like BM25) often produces better results than either approach alone.
Context window management becomes critical when the retrieved context is large. Stuffing too many chunks into the prompt exceeds the LLM's context window or degrades its ability to focus on the most relevant information. A re-ranking step (using a cross-encoder model) scores each retrieved chunk against the query and selects only the most relevant ones for the prompt.
RAG Pipeline Architecture
Vector Search and Embedding Infrastructure
Vector search is the retrieval engine that powers RAG, recommendation systems, similarity search, and semantic search.
It finds items (documents, products, images) that are semantically similar to a query, even when they share no keywords.
Embeddings
An embedding is a dense numerical vector (typically 256 to 1536 dimensions) that captures the semantic meaning of a piece of content.
Similar content produces similar vectors. "How to reset my password" and "I forgot my login credentials" have different words but their embeddings are close in vector space because they express the same intent.
Embedding models include OpenAI text-embedding-3-small/large, Cohere embed, Google Gecko, and open-source models like BGE, E5, and GTE (which can be self-hosted for cost control and data privacy).
Choosing an embedding model involves trade-offs between quality (how well the embeddings capture semantic nuance), dimensionality (higher dimensions capture more information but consume more storage and memory), latency (how quickly the model produces an embedding), and cost (API-based models charge per token, self-hosted models have infrastructure costs).
Vector Databases
Vector databases are optimized for storing millions to billions of vectors and performing approximate nearest neighbor (ANN) search in milliseconds.
ANN algorithms (HNSW, IVF, ScaNN) trade a small amount of recall accuracy for dramatic speed improvements over exact nearest neighbor search.
- Pinecone is a fully managed vector database. You send vectors, query for nearest neighbors, and Pinecone handles indexing, scaling, and infrastructure. It is the simplest option for teams that want zero operational overhead.
- Weaviate is an open-source vector database with built-in modules for generating embeddings, hybrid search (combining vector and keyword search), and multi-modal data (text, images). It can be self-hosted or used as a managed service.
- Milvus is an open-source vector database designed for massive scale (billions of vectors). It supports multiple ANN index types and can be deployed on Kubernetes. Zilliz Cloud offers a managed version.
- pgvector is a PostgreSQL extension that adds vector storage and similarity search to an existing PostgreSQL database. It is the most practical option when your vectors need to live alongside relational data and the vector count is in the low millions. Beyond that scale, a dedicated vector database performs better.
- Qdrant is an open-source vector database written in Rust, offering high performance and a rich filtering API for combining vector search with metadata filters.
| Database | Type | Scale | Strengths | Best For |
|---|---|---|---|---|
| Pinecone | Managed | Billions | Zero ops, simple API | Teams wanting managed infrastructure |
| Weaviate | Open source / managed | Millions to billions | Hybrid search, multi-modal | Full-featured search applications |
| Milvus | Open source / managed | Billions | Massive scale, multiple index types | Large-scale similarity search |
| pgvector | PostgreSQL extension | Low millions | No new infrastructure, SQL integration | Small-to-medium RAG, relational+vector |
| Qdrant | Open source / managed | Billions | Filtering, Rust performance | Filtered vector search, high performance |
Prompt Engineering and Prompt Management Systems
The prompt is the interface between your application and the LLM.
A well-crafted prompt can mean the difference between a useful, accurate response and a vague, incorrect one.
In production systems, prompts are not ad hoc strings. They are managed artifacts with versioning, testing, and deployment processes.
Prompt Engineering Techniques
System prompts set the LLM's behavior, persona, and constraints. "You are a customer service agent for an electronics company. Answer questions about products. If you do not know the answer, say so. Never discuss competitor products."
Few-shot examples include sample inputs and desired outputs in the prompt, teaching the LLM the expected format and behavior by demonstration. This is especially effective for structured outputs, classification tasks, and domain-specific formatting.
Chain of thought asks the LLM to reason step by step before giving a final answer. "Think through this step by step before providing your answer." This improves accuracy on complex reasoning tasks.
Output formatting instructions specify the exact response format. "Respond in JSON with fields: answer (string), confidence (float 0-1), sources (array of strings)." Structured output is essential when the LLM's response is consumed by application code rather than displayed to a user.
Prompt Management Systems
In production, prompts are not hardcoded in application code. They are managed as separate artifacts with their own lifecycle.
A prompt management system stores prompt templates with variables (e.g., "Answer the following question based on this context: {{context}}
Question: {{question}}"), versions each template so you can roll back to a previous version if a new prompt performs worse, associates each version with evaluation results (accuracy, hallucination rate, latency), enables A/B testing between prompt versions, and provides an interface for non-engineers (product managers, domain experts) to iterate on prompts without code deployments.
Tools for prompt management include LangSmith (by LangChain), PromptLayer, Humanloop, and Weights & Biases Prompts.
Many teams build internal prompt management systems using a simple database of versioned prompt templates with an evaluation pipeline.
Fine-Tuning Infrastructure and RLHF Pipelines
Base LLMs are general-purpose. They know a lot about many topics but are not specialized for any particular domain or task.
Fine-tuning adapts a general model to your specific needs.
Fine-Tuning Approaches
Full fine-tuning updates all model parameters on your dataset. This produces the most customized model but requires significant GPU resources (equivalent to training the model from scratch on your data) and risks catastrophic forgetting (the model loses general capabilities while learning specific ones).
LoRA (Low-Rank Adaptation) adds small trainable matrices to the model's attention layers while keeping the original weights frozen. Only the LoRA matrices are updated during training. This reduces GPU memory requirements by 60-80% and training time proportionally. The LoRA weights are small (often just a few hundred MB) and can be swapped in and out at serving time, enabling one base model to serve multiple fine-tuned variants.
QLoRA combines LoRA with quantization (reducing model weight precision from 16-bit to 4-bit). This further reduces memory requirements, enabling fine-tuning of a 70-billion-parameter model on a single GPU.
RLHF (Reinforcement Learning from Human Feedback)
RLHF is the technique that aligns LLMs with human preferences. It was a key ingredient in making ChatGPT significantly more helpful and less harmful than the base GPT model.
The RLHF pipeline has three stages.
- First, supervised fine-tuning (SFT): the base model is fine-tuned on a dataset of high-quality prompt-response pairs.
- Second, reward model training: human evaluators rank multiple model responses to the same prompt from best to worst. A reward model learns to predict which responses humans prefer.
- Third, reinforcement learning: the SFT model generates responses, the reward model scores them, and the RL algorithm (typically PPO or DPO) updates the model to produce responses that score higher.
RLHF infrastructure requires a human evaluation pipeline (collecting preference data from annotators), reward model training infrastructure (a separate model trained on preference data), RL training infrastructure (more complex than standard fine-tuning because it involves generating responses, scoring them, and updating the model in a loop), and quality assurance (monitoring for reward hacking, where the model learns to exploit the reward model rather than genuinely improving).
LLM Gateway and Model Routing
As organizations adopt multiple LLMs (GPT-4 for complex reasoning, a smaller model for simple tasks, a fine-tuned model for domain-specific queries, an open-source model for cost-sensitive workloads), they need a layer that routes requests to the right model.
What an LLM Gateway Does
An LLM gateway sits between your application and the LLM providers.
It handles routing (sending requests to the appropriate model based on task type, cost constraints, or model availability), failover (if one provider is down or slow, automatically routing to a backup), rate limiting (enforcing per-team or per-application usage quotas), cost tracking (logging tokens consumed per request, per team, per model), caching (storing responses for identical prompts to avoid redundant API calls), and observability (logging prompts, responses, latencies, and token counts for debugging and analysis).
Model Routing Strategies
Task-based routing sends different types of requests to different models. Simple classification or summarization tasks go to a smaller, cheaper model (GPT-4o-mini, Claude Haiku). Complex multi-step reasoning goes to a larger model (GPT-4, Claude Opus). The router classifies the task and selects the model accordingly.
Cost-optimized routing starts with the cheapest model and escalates to more expensive ones only if the cheap model's confidence is low or the response quality is insufficient. A request first goes to a small model. If the response meets quality thresholds, it is returned. If not, the request is forwarded to a larger model.
Latency-based routing sends time-sensitive requests to the fastest available model and tolerates higher latency for batch or background tasks.
Tools like LiteLLM, Portkey, and Martian provide LLM gateway functionality. Many teams build custom gateways using a simple proxy that wraps provider APIs with routing logic, caching, and monitoring.
Cost Optimization for LLM Inference
LLM inference is expensive. A GPT-4-class model can cost $10-30 per million input tokens and $30-60 per million output tokens. At scale (millions of requests per day), this becomes a significant operational cost that requires active optimization.
Optimization Techniques
Model selection: Use the smallest model that meets quality requirements. If 80% of your requests can be handled adequately by a model that costs 10x less than GPT-4, route those requests to the cheaper model (model routing above).
Prompt optimization: Shorter prompts cost fewer tokens. Remove unnecessary instructions, reduce verbose system prompts, and compress context. A RAG prompt that includes 5 chunks instead of 10 costs half the input tokens with potentially equivalent quality if the top 5 chunks are sufficiently relevant.
Caching: Identical or semantically similar prompts should not be sent to the LLM repeatedly. Cache responses for exact prompt matches (simple key-value cache). For semantically similar queries, use embedding-based caching: if a new query's embedding is very close to a cached query's embedding, return the cached response.
Quantization: For self-hosted models, reducing weight precision from FP16 to INT8 or INT4 halves or quarters the memory requirements and increases throughput, often with minimal quality degradation. GPTQ, AWQ, and bitsandbytes are popular quantization methods.
Batching: Processing multiple requests in a single batch increases GPU utilization and reduces per-request cost (covered in the serving architecture section above).
Output length control: Set max_tokens to the minimum necessary for the use case. A classification task needs 10 tokens, not the default 4096. Shorter outputs generate faster and cost less.
| Technique | Savings | Effort | Quality Impact |
|---|---|---|---|
| Model selection (smaller model) | 5-20x | Low (routing logic) | Moderate (task-dependent) |
| Prompt shortening | 20-50% input cost | Low | Minimal if done carefully |
| Response caching | Up to 90% for repeated queries | Medium | None (exact match) to low (semantic) |
| Quantization (self-hosted) | 2-4x compute efficiency | Medium | Low for INT8, moderate for INT4 |
| Output length limits | Proportional to reduction | Low | None if limits are appropriate |
AI Agent Architectures and Tool-Use Systems
An AI agent is an LLM-powered system that can take actions, not just generate text.
Instead of answering "What is the weather in Tokyo?" with text from training data, an agent calls a weather API, gets the current temperature, and returns a factual, up-to-date answer.
How Tool Use Works
The LLM receives a list of available tools (functions) with descriptions of what each tool does and what parameters it accepts.
When the LLM determines it needs external information or needs to perform an action, it generates a structured tool call (specifying the tool name and parameters) instead of generating text.
The system executes the tool call, returns the result to the LLM, and the LLM generates its final response incorporating the tool's output.
A customer service agent might have tools for looking up order status (input: order ID, output: status and tracking number), initiating a return (input: order ID, reason), checking inventory (input: product ID, output: stock count), and escalating to a human agent (input: conversation summary).
Agent Architectures
ReAct (Reasoning + Acting) is the most common agent pattern. The LLM alternates between reasoning (thinking about what to do next) and acting (calling a tool). The loop continues until the LLM has enough information to provide a final answer.
Multi-agent systems use multiple specialized LLM agents that collaborate. A research agent gathers information. An analysis agent evaluates it. A writing agent produces the final output. Each agent has its own system prompt, tools, and expertise. A coordinator routes tasks between agents.
Plan-and-execute agents separate planning from execution. The planner LLM creates a step-by-step plan. An executor processes each step (calling tools, making sub-queries). This produces more reliable results for complex, multi-step tasks because the full plan is visible and can be reviewed before execution begins.
Agent Frameworks
LangChain and LlamaIndex provide abstractions for building agents with tool use, RAG, and chaining. CrewAI focuses on multi-agent collaboration. AutoGen (Microsoft) provides a framework for conversational agents. These frameworks handle the mechanics of tool calling, result parsing, and multi-step reasoning loops.
Guardrails, Safety, and Content Filtering
LLMs can generate harmful, biased, incorrect, or off-topic content.
In production systems, guardrails prevent the model from producing outputs that violate your policies, harm users, or expose the company to legal risk.
Input Guardrails
Input guardrails filter or modify user inputs before they reach the LLM.
They detect and block prompt injection attacks (where a user crafts an input that tries to override the system prompt), filter personally identifiable information (redacting credit card numbers, social security numbers, or addresses from prompts), reject off-topic inputs (a customer service bot should not answer questions about building explosives), and enforce input length limits (preventing excessively long prompts that increase cost and latency).
Output Guardrails
Output guardrails inspect the LLM's response before it reaches the user.
They detect and filter toxic, offensive, or harmful content, verify factual claims against a knowledge base (reducing hallucinations), enforce response format compliance (ensuring JSON responses are valid, ensuring the model does not include information it was instructed to withhold), block PII leakage (ensuring the model does not expose sensitive data from its context window), and apply brand and tone guidelines (ensuring responses match the company's voice).
Implementation Approaches
Classifier-based guardrails use a separate ML model to classify inputs and outputs. An input classifier detects harmful prompts. An output classifier detects toxic responses. These classifiers run in milliseconds and add minimal latency.
Rule-based guardrails use regex patterns, keyword lists, and structural checks. They are fast and deterministic but brittle (easily circumvented by creative prompting).
LLM-based guardrails use a second, cheaper LLM to evaluate the primary LLM's output. "Does this response contain harmful content? Does it answer a question outside the allowed scope? Does it reveal system prompt instructions?" This is more nuanced than classifiers but adds latency and cost.
NVIDIA NeMo Guardrails, Guardrails AI, and LangChain's output parsers provide frameworks for implementing guardrails. Many teams combine all three approaches: rule-based filters for obvious violations, classifiers for nuanced content detection, and LLM-based checks for complex policy enforcement.
Evaluation and Benchmarking Systems for LLMs
How do you know if your LLM system is working well?
Traditional ML has clear metrics (accuracy, precision, recall).
LLM evaluation is harder because "good" is subjective and task-dependent.
A response might be factually correct but poorly formatted, or well-written but slightly off-topic.
Evaluation Approaches
Automated metrics provide quick, scalable evaluation. BLEU and ROUGE scores measure text similarity between generated responses and reference answers (useful for translation and summarization but not for open-ended generation). Perplexity measures how "surprised" the model is by text (lower is better, but low perplexity does not guarantee useful responses). Task-specific metrics (exact match for Q&A, code execution pass rate for coding) are more meaningful when applicable.
LLM-as-judge uses a strong LLM (like GPT-4) to evaluate the output of the system being tested. You provide the judge with the prompt, the generated response, and evaluation criteria ("Rate this response from 1-5 on accuracy, helpfulness, and clarity"). The judge produces a score and explanation. This scales better than human evaluation and correlates reasonably well with human judgments for many tasks.
Human evaluation remains the gold standard. Human evaluators rate responses on quality dimensions relevant to the application: accuracy, helpfulness, safety, fluency, and adherence to instructions. Human evaluation is expensive and slow but catches nuances that automated metrics and LLM judges miss.
Evaluation Pipeline
A production evaluation pipeline runs automatically when a prompt changes, a model is updated, or a RAG knowledge base is modified.
It maintains a curated set of test cases (queries with expected or reference answers), runs the system against all test cases, computes automated metrics and LLM-judge scores, compares results against the previous version, and flags regressions for human review.
This is the LLM equivalent of a regression test suite.
Every change to any component (prompt, model, retrieval, guardrails) is evaluated against the same test cases to detect degradation before it reaches users.
Benchmarking
Public benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA) measure general model capabilities and enable comparison across models.
But public benchmarks do not tell you how a model performs on your specific task with your specific data.
Build your own benchmark dataset that reflects your actual use cases: real user queries, edge cases discovered in production, and adversarial examples designed to test guardrails.
Beginner Mistake to Avoid
New engineers sometimes evaluate their LLM system using a handful of manually tested queries and declare it "good enough."
This is like testing a web application by visiting three pages and shipping to production. LLM behavior is highly variable.
A model that handles 50 test queries perfectly might fail on the 51st in a way that is embarrassing or harmful.
Build a comprehensive evaluation dataset (200+ queries covering normal cases, edge cases, and adversarial inputs), automate the evaluation, and run it on every change.
Interview-Style Question
Q: You are building a customer support chatbot for a financial services company. The chatbot should answer questions about account balances, transactions, and policies using the company's internal documentation. How would you design this system?
A: RAG architecture with strong guardrails. Knowledge base: index the company's policy documents, FAQ articles, and product guides in a vector database (Pinecone or pgvector). Use a chunking strategy that respects document structure (split at section boundaries, not mid-sentence). Retrieval: when a user asks a question, embed the query, retrieve the top 5 relevant chunks using hybrid search (vector + keyword), and re-rank with a cross-encoder. Serving: construct a prompt with the system instructions ("You are a financial services assistant. Answer only based on the provided context. If the context does not contain the answer, say you do not know. Never provide specific financial advice."), the retrieved context, and the user's query. Route to an appropriate model: simple FAQ questions go to a smaller model (cost-efficient), complex policy questions go to a larger model (better reasoning). Guardrails: input guardrails filter PII from user messages and block prompt injection attempts. Output guardrails verify the response does not contain financial advice (legal risk), does not hallucinate information not in the retrieved context, and does not reveal system prompt instructions. For account-specific queries (balance, transactions), the agent uses tool calls to the banking API (authenticated with the user's session token), never exposing raw database data in the prompt. Evaluation: a dataset of 500 test queries covering common questions, edge cases, and adversarial inputs. Automated evaluation uses LLM-as-judge for quality and a classifier for safety violations. Human evaluation reviews a sample weekly. Monitoring tracks response quality scores, hallucination rate, user satisfaction (thumbs up/down), and escalation rate to human agents. The system retrains the retrieval pipeline when new documentation is added and re-evaluates whenever prompts or models change.
KEY TAKEAWAYS
-
LLM serving requires GPU management, continuous batching, and KV cache optimization. vLLM and TensorRT-LLM are the leading serving engines for throughput-efficient inference.
-
RAG grounds LLM responses in external knowledge by retrieving relevant document chunks and injecting them into the prompt. Chunking strategy and retrieval quality determine answer quality.
-
Vector databases (Pinecone, Weaviate, Milvus, pgvector) power the similarity search behind RAG, recommendations, and semantic search.
-
Prompt management treats prompts as versioned artifacts with testing, evaluation, and deployment processes, not as hardcoded strings.
-
LoRA and QLoRA enable efficient fine-tuning without full model retraining. RLHF aligns models with human preferences through a reward model and reinforcement learning.
-
LLM gateways route requests to the right model based on task complexity, cost, and latency requirements. They handle failover, caching, and cost tracking.
-
Cost optimization combines model selection, prompt shortening, caching, quantization, and output length control. Using the smallest adequate model for each task is the highest-impact optimization.
-
AI agents extend LLMs with tool use, enabling real actions (API calls, database queries) alongside text generation. ReAct and plan-and-execute are the primary agent patterns.
-
Guardrails (input and output) prevent harmful, incorrect, or off-policy content. Layer rule-based, classifier-based, and LLM-based checks for comprehensive safety.
-
Evaluation combines automated metrics, LLM-as-judge, and human evaluation. Build a comprehensive test dataset and run evaluations automatically on every system change.