A comprehensive glossary of system design terms referenced throughout this handbook.
Terms are organized alphabetically.
For detailed explanations, see the referenced chapters.
A/B Testing. Running controlled experiments with two (or more) variants to determine which performs better on a target metric. (Chapter VI)
ABAC (Attribute-Based Access Control). Authorization model that evaluates permissions based on attributes of the user, resource, action, and environment. (Chapter IV)
ACID. Atomicity, Consistency, Isolation, Durability. The four guarantees provided by relational database transactions. (Chapter II)
Active-Active. Redundancy configuration where all instances handle traffic simultaneously. (Chapter III)
Active-Passive. Redundancy configuration where one instance handles traffic while others stand by as backups. (Chapter III)
Adaptive Bitrate Streaming (ABR). Video delivery technique where the player switches quality levels based on current bandwidth. (Chapter VII)
Ambassador Pattern. A proxy that handles outbound communication complexity for a service. (Chapter IV)
Anti-Corruption Layer (ACL). A translation boundary between your system and an external system's data model. (Chapter IV)
API Gateway. A single entry point for all client requests that handles routing, authentication, rate limiting, and transformation. (Chapter II)
Auto-Scaling. Automatically adjusting the number of server instances based on real-time metrics like CPU utilization or request rate. (Chapter III)
Availability. The percentage of time a system is operational and accessible. Measured in "nines" (99.9%, 99.99%). (Chapter III)
Backpressure. Mechanisms that prevent a system from being overwhelmed by slowing down producers when consumers cannot keep up. (Chapter II)
BASE. Basically Available, Soft State, Eventual Consistency. The consistency model used by most NoSQL databases. (Chapter II)
BFT (Byzantine Fault Tolerance). Consensus that tolerates up to one-third of Chaptericipants being malicious. (Chapter VII)
Bloom Filter. A probabilistic data structure that can tell you "definitely not in the set" or "probably in the set." Used for cache penetration prevention. (Chapter II)
Blue-Green Deployment. Maintaining two identical production environments and switching traffic between them for zero-downtime deployments. (Chapter VI)
BM25. The standard relevance scoring algorithm for full-text search, improving on TF-IDF with term frequency saturation and document length normalization. (Chapter IV)
Bounded Context. A DDD concept defining a boundary within which a specific model applies. Used to determine microservice boundaries. (Chapter III)
Bulkhead Pattern. Isolating critical resources into separate partitions so that failure in one does not exhaust resources used by others. (Chapter III)
Cache Aside (Lazy Loading). Caching strategy where the application checks the cache, falls back to the database on miss, and populates the cache afterward. (Chapter II)
Cache Stampede. When a popular cache entry expires and many requests simultaneously hit the database to rebuild it. (Chapter II)
Canary Deployment. Routing a small percentage of traffic to a new version to detect issues before full rollout. (Chapter VI)
CAP Theorem. A distributed system can provide at most two of three: Consistency, Availability, Partition Tolerance. (Chapter III)
CDC (Change Data Capture). Capturing database changes from the transaction log and publishing them as events. (Chapter VII)
CDN (Content Delivery Network). A geographically distributed network of servers that caches content close to users. (Chapter II)
Chaos Engineering. Deliberately injecting failures into a system to discover weaknesses. (Chapter III)
Circuit Breaker. A pattern that stops calling a failing service, returning fallback responses until the service recovers. (Chapter III)
Consistent Hashing. A hashing technique where adding or removing nodes only redistributes a small fraction of keys. (Chapter II)
CQRS (Command Query Responsibility Segregation). Using separate models for read and write operations. (Chapter III)
CRDT (Conflict-free Replicated Data Type). Data structures that merge automatically without conflicts across replicas. (Chapter IV)
Data Lake. A centralized repository storing raw data in any format on cheap object storage. (Chapter VII)
Data Lakehouse. Architecture combining data lake flexibility with data warehouse performance using open table formats. (Chapter VII)
Dead Letter Queue (DLQ). A secondary queue where unprocessable messages are sent after failed processing attempts. (Chapter II)
Denormalization. Deliberately introducing data redundancy to improve read performance. (Chapter II)
Distributed Lock. A mechanism ensuring exclusive access to a shared resource across multiple nodes. (Chapter III)
DNS (Domain Name System). Translates domain names into IP addresses. (Chapter II)
Edge Function. Code that executes at CDN edge servers, close to the user. (Chapter II)
Embedding. A dense numerical vector capturing the semantic meaning of content, used in ML and search. (Chapter VII)
Error Budget. The allowed downtime based on your SLO. Balances reliability investment with feature development velocity. (Chapter III)
ETL (Extract, Transform, Load). A data pipeline pattern for moving data between systems with transformation. (Chapter IV)
Eventual Consistency. A consistency model where all replicas converge to the same value over time, but may differ briefly after a write. (Chapter III)
Event Sourcing. Storing every state change as an immutable event, deriving current state by replaying events. (Chapter II)
Failover. Redirecting traffic from a failed component to a healthy backup. (Chapter II)
Fan-Out. Distributing a single event to many recipients. Fan-out on write pushes at write time. Fan-out on read pulls at read time. (Chapter VIII)
Feature Flag. A toggle that enables or disables a feature in production without deployment. (Chapter IV)
Feature Store. A centralized system for computing, storing, and serving ML features consistently for training and serving. (Chapter VI)
Geohashing. Encoding geographic coordinates into a string that groups nearby locations with similar prefixes. (Chapter VIII)
GitOps. Using git as the single source of truth for deployment configuration, with agents ensuring the cluster matches the declared state. (Chapter VI)
Gossip Protocol. Spreading information through random peer-to-peer communication, like rumors through a social group. (Chapter IV)
gRPC. A high-performance RPC framework using Protocol Buffers and HTTP/2. (Chapter II)
GSLB (Global Server Load Balancing). Distributing traffic across multiple geographic regions based on user location and health. (Chapter II)
Guardrails (LLM). Input and output filters that prevent harmful, incorrect, or off-policy content from LLMs. (Chapter VII)
HDFS (Hadoop Distributed File System). A distributed file system for storing large datasets across a cluster of machines. (Chapter II)
Health Check. Periodic verification that a service instance is functioning correctly. (Chapter II)
HLS (HTTP Live Streaming). Apple's protocol for delivering live and on-demand video via HTTP. (Chapter VII)
Hot Key. A single cache or database key receiving disproportionate traffic. (Chapter II)
Idempotency. An operation that produces the same result whether executed once or multiple times. (Chapter II)
Inverted Index. A data structure mapping terms to the documents containing them, enabling fast full-text search. (Chapter IV)
IoC (Inversion of Control). A design principle where object creation and dependency management are handled externally rather than internally. (Chapter V)
JWT (JSON Web Token). A self-contained, stateless token carrying user identity and permissions claims. (Chapter IV)
Kafka. A distributed event streaming platform that stores events in an immutable, partitioned log. (Chapter II)
KV Cache (LLM). Cached attention key and value tensors from previous tokens, avoiding redundant computation during text generation. (Chapter VII)
Lambda Architecture. Running parallel batch and stream processing paths with a serving layer that merges results. (Chapter IV)
Latency. The time between a request being sent and a response being received. (Chapter I)
Leader Election. Selecting one node among multiple to perform a specific responsibility. (Chapter III)
Linearizability. A consistency guarantee where operations appear to take effect instantaneously at some point between their start and completion. (Chapter III)
Load Balancer. Distributes incoming requests across multiple servers. (Chapter II)
LoRA (Low-Rank Adaptation). An efficient fine-tuning technique that adds small trainable matrices to frozen model weights. (Chapter VII)
LRU (Least Recently Used). A cache eviction policy that removes the entry accessed least recently. (Chapter II)
Merkle Tree. A binary hash tree where each parent is the hash of its children, enabling efficient verification. (Chapter VII)
Message Queue. Infrastructure that stores messages between producers and consumers for asynchronous communication. (Chapter II)
Microservices. Architecture of small, independently deployable services each owning a specific business capability. (Chapter III)
Monolith. A single application where all functionality lives in one codebase and deploys as one unit. (Chapter III)
MQTT. A lightweight pub/sub protocol designed for IoT devices and constrained networks. (Chapter VII)
Mutual TLS (mTLS). TLS where both client and server verify each other's identity. Standard in service meshes. (Chapter II)
Normalization. Organizing database tables to minimize redundancy (1NF, 2NF, 3NF, BCNF). (Chapter II)
OLAP (Online Analytical Processing). Database workload pattern involving aggregation queries over large datasets. Columnar storage. (Chapter VII)
OLTP (Online Transaction Processing). Database workload pattern involving high-concurrency, low-latency operations on individual records. Row storage. (Chapter VII)
Operational Transform (OT). An algorithm for merging concurrent edits in real-time collaborative editing. (Chapter VII)
Outbox Pattern. Writing events to a database table within the same transaction as the data change, then publishing them separately for guaranteed consistency. (Chapter IV)
PACELC Theorem. Extends CAP: if Partition, choose Availability or Consistency; Else, choose Latency or Consistency. (Chapter III)
Partitioning (Sharding). Splitting data across multiple database servers by a partition key. (Chapter II)
PII (Personally Identifiable Information). Data that can identify an individual: names, emails, phone numbers, SSNs. (Chapter IV)
Proof of Stake (PoS). Blockchain consensus where validators stake cryptocurrency as collateral. (Chapter VII)
Proof of Work (PoW). Blockchain consensus requiring computational puzzle-solving. (Chapter VII)
Pub/Sub (Publish/Subscribe). Messaging pattern where publishers send events to topics and subscribers receive copies. (Chapter II)
Quadtree. A spatial data structure that recursively divides space into four quadrants for efficient geographic queries. (Chapter VIII)
Quorum. The minimum number of nodes that must agree for a read or write to be considered successful. (Chapter IV)
Raft. A consensus algorithm that ensures distributed nodes agree on a value through leader election and log replication. (Chapter III)
RAG (Retrieval-Augmented Generation). Grounding LLM responses in external knowledge by retrieving relevant documents and injecting them into the prompt. (Chapter VII)
Rate Limiting. Controlling the number of requests a client can make within a time period. (Chapter IV)
RBAC (Role-Based Access Control). Authorization model where permissions are assigned to roles, and roles to users. (Chapter IV)
Read Replica. A database copy that serves read queries, offloading the primary. (Chapter II)
Reverse Proxy. A server that sits in front of backend servers, handling security, SSL, caching, and routing. (Chapter II)
RLHF (Reinforcement Learning from Human Feedback). Training technique that aligns LLMs with human preferences using reward models. (Chapter VII)
RPO (Recovery Point Objective). Maximum acceptable data loss measured in time. (Chapter III)
RTO (Recovery Time Objective). Maximum acceptable downtime measured in time. (Chapter III)
Saga Pattern. Breaking distributed transactions into local transactions with compensating actions for rollback. (Chapter III)
Serialization. Converting data structures into a format for storage or transmission (JSON, Protocol Buffers, Avro). (Chapter II)
Service Discovery. Maintaining a registry of available service instances and their addresses. (Chapter IV)
Service Mesh. Infrastructure layer of sidecar proxies managed by a control plane for service-to-service communication. (Chapter II)
Sidecar Pattern. An auxiliary process running alongside a service handling cross-cutting concerns like logging, TLS, and observability. (Chapter IV)
SLA (Service Level Agreement). A contract with customers defining availability and performance guarantees. (Chapter III)
SLI (Service Level Indicator). A measurable metric (latency, error rate, throughput) used to evaluate service health. (Chapter III)
SLO (Service Level Objective). An internal target for an SLI, set tighter than the SLA. (Chapter III)
Smart Contract. A program stored on a blockchain that executes automatically when conditions are met. (Chapter VII)
Snowflake ID. A 64-bit unique ID composed of timestamp, machine ID, and sequence number. (Chapter IV)
SOLID. Five design principles: Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, Dependency Inversion. (Chapter V)
SPOF (Single Point of Failure). A component whose failure brings down the entire system. (Chapter III)
SSE (Server-Sent Events). A protocol for one-way server-to-client push over HTTP. (Chapter II)
Strangler Fig Pattern. Incrementally migrating from an old system to a new one by routing traffic progressively. (Chapter IV)
Strong Consistency. Every read sees the most recent write across all nodes. (Chapter III)
Throughput. The number of requests or operations a system can handle per unit of time. (Chapter I)
TLS (Transport Layer Security). Protocol for encrypting data in transit over a network. (Chapter IV)
Tokenization (Payment). Replacing sensitive data with a non-sensitive token that maps back through a secure lookup. (Chapter IV)
Trie (Prefix Tree). A tree data structure for efficient prefix-based lookups, used in autocomplete systems. (Chapter IV)
TTL (Time to Live). Duration before a cached or DNS entry expires and must be refreshed. (Chapter II)
ULID (Universally Unique Lexicographically Sortable Identifier). A 128-bit sortable unique ID with a timestamp prefix. (Chapter IV)
UUID (Universally Unique Identifier). A 128-bit identifier generated without coordination. v4 is random, v7 is time-sorted. (Chapter IV)
Vector Clock. A mechanism for tracking causal relationships between events in a distributed system. (Chapter IV)
Vector Database. A database optimized for storing and searching high-dimensional vectors (embeddings). (Chapter II)
VPC (Virtual Private Cloud). An isolated network within a cloud provider for resource security. (Chapter IV)
WAF (Web Application Firewall). A firewall that inspects HTTP traffic and blocks known attack patterns. (Chapter IV)
WebSocket. A protocol for persistent, bidirectional, real-time communication between client and server. (Chapter II)
Write-Ahead Log (WAL). A log where database changes are recorded before being applied, ensuring durability and enabling CDC. (Chapter VII)
Zero-Trust Architecture. Security model where nothing is trusted by default, regardless of network location. Every request is authenticated and authorized. (Chapter IV)