6.1 Cloud & Infrastructure - The System Design Interview Handbook

Cloud Computing Models: IaaS, PaaS, SaaS, FaaS

Cloud computing eliminates the need to buy, rack, and maintain physical servers.

Instead, you rent computing resources from a provider who manages the hardware.

What varies between cloud models is how much of the stack the provider manages versus how much you manage yourself.

IaaS (Infrastructure as a Service)

IaaS gives you virtual machines, storage, and networking. You manage everything above the hardware: the operating system, runtime, application code, and data.

The provider handles the physical servers, power, cooling, and network infrastructure.

AWS EC2, Google Compute Engine, and Azure Virtual Machines are IaaS.

Use IaaS when you need full control over your environment or when you are running software that requires specific OS configurations.

PaaS (Platform as a Service)

PaaS adds the operating system, runtime, and middleware to what the provider manages.

You deploy your application code, and the platform handles scaling, patching, load balancing, and server management.

AWS Elastic Beanstalk, Google App Engine, Heroku, and Azure App Service are PaaS.

Use PaaS when you want to focus on application code without managing infrastructure.

SaaS (Software as a Service)

SaaS is a complete application delivered over the internet. You configure it and use it but do not manage any infrastructure or code.

Gmail, Salesforce, Slack, and Datadog are SaaS. In system design, you consume SaaS products as dependencies (sending emails through SendGrid, monitoring with Datadog) rather than building them yourself.

FaaS (Function as a Service)

FaaS runs individual functions in response to events. You write a function, deploy it, and the provider handles everything: provisioning, scaling, and execution.

AWS Lambda, Azure Functions, and Google Cloud Functions are FaaS. Use FaaS for event-driven, short-lived, stateless workloads.

Model	You Manage	Provider Manages	Examples
IaaS	OS, runtime, application, data	Hardware, virtualization, networking	EC2, Compute Engine, Azure VMs
PaaS	Application code, data	OS, runtime, scaling, patching	Elastic Beanstalk, App Engine, Heroku
SaaS	Configuration, usage	Everything	Gmail, Salesforce, Datadog
FaaS	Function code	Everything else	Lambda, Azure Functions, Cloud Functions

The trend in the industry is upward on this stack.

Teams start with IaaS for maximum control, then migrate components to PaaS or FaaS as they realize the operational overhead of managing infrastructure does not justify the control it provides.

Most production systems use a mix: IaaS for workloads that need custom environments, PaaS for standard web applications, FaaS for event-driven processing, and SaaS for capabilities that are not core to the business (email delivery, monitoring, authentication).

AWS, Azure, GCP: Core Services Overview

The three major cloud providers offer hundreds of services each, but a small set of core services covers the vast majority of system design needs.

Knowing the equivalent services across providers helps you design cloud-agnostic architectures and communicate in interviews regardless of which cloud the interviewer's company uses.

Compute

Capability	AWS	Azure	GCP
Virtual machines	EC2	Virtual Machines	Compute Engine
Containers (managed)	ECS, EKS (Kubernetes)	AKS (Kubernetes)	GKE (Kubernetes)
Serverless functions	Lambda	Azure Functions	Cloud Functions
Serverless containers	Fargate	Container Apps	Cloud Run

Storage

Capability	AWS	Azure	GCP
Object storage	S3	Blob Storage	Cloud Storage
Block storage	EBS	Managed Disks	Persistent Disk
File storage	EFS	Azure Files	Filestore
Archive storage	S3 Glacier	Archive Storage	Archive Storage

Databases

Capability	AWS	Azure	GCP
Relational (managed)	RDS, Aurora	Azure SQL, PostgreSQL	Cloud SQL, AlloyDB
NoSQL (document)	DynamoDB	Cosmos DB	Firestore
In-memory cache	ElastiCache (Redis/Memcached)	Azure Cache for Redis	Memorystore
Data warehouse	Redshift	Synapse Analytics	BigQuery

Networking

Capability	AWS	Azure	GCP
Virtual network	VPC	Virtual Network (VNet)	VPC
Load balancer	ALB, NLB	Azure Load Balancer	Cloud Load Balancing
CDN	CloudFront	Azure CDN	Cloud CDN
DNS	Route 53	Azure DNS	Cloud DNS

Messaging & Streaming

Capability	AWS	Azure	GCP
Message queue	SQS	Service Bus Queues	Cloud Tasks
Pub/sub	SNS	Service Bus Topics	Pub/Sub
Event streaming	Kinesis, MSK (Kafka)	Event Hubs	Dataflow

You do not need to memorize every service.

The pattern is consistent: each provider has equivalent services for compute, storage, databases, networking, and messaging.

In interviews, saying "an object store like S3" or "a managed Kubernetes service like EKS" demonstrates knowledge without coupling your design to a specific vendor.

Virtual Machines vs. Containers

Understanding the difference between VMs and containers is fundamental to modern infrastructure decisions.

Both isolate workloads, but they do so at different levels and with different trade-offs.

Virtual Machines

A virtual machine runs a complete operating system on virtualized hardware.

A hypervisor (like VMware ESXi, KVM, or Hyper-V) sits between the physical hardware and the VMs, allocating CPU, memory, and storage to each VM.

Each VM includes its own kernel, its own OS libraries, and the application running on top.

VMs provide strong isolation. Each VM is a fully independent computer.

A crash or security compromise in one VM does not affect others. This isolation makes VMs suitable for running untrusted workloads and for workloads that require different operating systems on the same physical host.

The cost is overhead. Each VM includes a full OS installation (often 1-2 GB), its own kernel processes consuming CPU and memory, and a boot time measured in minutes.

Running 50 microservices as 50 VMs wastes significant resources on 50 copies of the OS.

Containers

A container packages an application and its dependencies (libraries, runtime, configuration) into a single lightweight unit.

Containers share the host operating system's kernel.

They use kernel features (namespaces for isolation, cgroups for resource limits) to create isolated environments without running a separate OS.

Because containers share the kernel, they start in seconds (not minutes), consume megabytes of overhead (not gigabytes), and can run hundreds of containers on a single host that would support only a dozen VMs.

Aspect	Virtual Machines	Containers
Isolation level	Full OS-level (separate kernel)	Process-level (shared kernel)
Size	Gigabytes (includes full OS)	Megabytes (app + dependencies)
Startup time	Minutes	Seconds
Density	Dozens per host	Hundreds per host
Security isolation	Strong (separate kernels)	Moderate (shared kernel, namespace isolation)
Best for	Multi-OS workloads, strong isolation needs	Microservices, CI/CD, rapid scaling

Most modern applications run on containers.

VMs are still used for workloads that require different OS types, for legacy applications that cannot be containerized, and for situations requiring the strongest possible isolation (multi-tenant platforms where tenants must be fully isolated).

Containerization: Docker Fundamentals

Docker is the tool that made containers practical and mainstream.

Before Docker, containerization existed (Linux LXC) but was complex to configure.

Docker simplified the process to: write a Dockerfile, build an image, and run a container.

Core Concepts

A Dockerfile is a text file with instructions for building a container image. Each instruction adds a layer: install the runtime, copy application code, install dependencies, define the startup command. A simple Node.js Dockerfile might be:

FROM node:20-alpine  
WORKDIR /app  
COPY package\*.json ./  
RUN npm install \--production  
COPY . .  
EXPOSE 3000  
CMD \["node", "server.js"\]

A Docker image is the built artifact from a Dockerfile. It is a read-only template containing the application and all its dependencies.

Images are versioned and stored in a container registry (Docker Hub, AWS ECR, Google Container Registry, GitHub Container Registry).

A container is a running instance of an image. You can run multiple containers from the same image, each isolated from the others.

Containers are ephemeral by default.

When a container stops, any data written inside it is lost. Persistent data must be stored in volumes (mounted directories from the host or external storage).

A Docker Compose file defines multi-container applications.

If your application needs a web server, a database, and a Redis cache, a docker-compose.yml file describes all three services, their networking, and their storage.

A single docker-compose up command starts the entire stack.

Why Docker Matters for System Design

Docker solves the "it works on my machine" problem.

The container includes everything the application needs: the right runtime version, the right library versions, the right OS dependencies.

If it works in the container, it works in production.

Docker images are the deployment artifact in modern CI/CD pipelines. You build the image once, test it, push it to a registry, and deploy the same image to staging and production.

No more differences between environments.

Docker also enables microservices architecture practically. Each microservice is a separate container with its own dependencies.

Different services can use different language runtimes without conflicts.

One service runs Python 3.12 while another runs Java 21 on the same host.

Container Orchestration: Kubernetes, ECS, Nomad

Running one container is easy. Running hundreds of containers across dozens of machines, ensuring they stay healthy, scale with traffic, and recover from failures, is the problem container orchestration solves.

Kubernetes

Kubernetes (K8s) is the dominant container orchestration platform.

Originally built by Google and now maintained by the Cloud Native Computing Foundation (CNCF), it runs in production at most large-scale software companies.

Kubernetes organizes containers into Pods (the smallest deployable unit, typically one container per pod).

Pods are managed by Deployments (which ensure a specified number of pod replicas are running), exposed through Services (stable network endpoints that route traffic to pods), and configured through ConfigMaps and Secrets.

Key Kubernetes capabilities include automatic scaling (Horizontal Pod Autoscaler adds or removes pods based on CPU, memory, or custom metrics), self-healing (if a pod crashes, Kubernetes restarts it; if a node fails, pods are rescheduled to healthy nodes), rolling updates (deploy a new version by gradually replacing old pods with new ones, with automatic rollback if health checks fail), service discovery (pods find each other through DNS and Kubernetes Services), and resource management (CPU and memory limits per pod prevent one workload from starving others).

Kubernetes has a significant learning curve. Its configuration is verbose (YAML files for every resource), its networking model is complex, and operating a Kubernetes cluster requires dedicated expertise.

Managed Kubernetes services (EKS on AWS, GKE on GCP, AKS on Azure) reduce the operational burden by handling the control plane (API server, scheduler, etcd), but you still manage the worker nodes and application configuration.

Amazon ECS

ECS (Elastic Container Service) is AWS's proprietary container orchestration service. It is simpler than Kubernetes but tightly coupled to the AWS ecosystem.

ECS manages tasks (equivalent to pods) on a cluster of EC2 instances or on Fargate (serverless containers where AWS manages the underlying compute).

ECS is a good choice for teams already deep in the AWS ecosystem who want container orchestration without the complexity of Kubernetes. It integrates natively with ALB, IAM, CloudWatch, and other AWS services with minimal configuration.

HashiCorp Nomad

Nomad is a simpler, more flexible orchestrator that handles not just containers but also VMs, Java applications, and raw binaries. Its single-binary architecture (one binary for both client and server) makes it easier to deploy and operate than Kubernetes.

Nomad is popular in organizations that need orchestration for heterogeneous workloads (not just containers) or that find Kubernetes too complex for their scale. It integrates with other HashiCorp tools (Consul for service discovery, Vault for secrets management).

Orchestrator	Complexity	Ecosystem Lock-in	Strengths	Best For
Kubernetes	High	Cloud-agnostic	Industry standard, massive ecosystem	Large-scale, multi-cloud, complex workloads
ECS	Medium	AWS-specific	Simple, deep AWS integration	AWS-native teams, moderate complexity
Nomad	Low-Medium	Cloud-agnostic	Simple, multi-workload support	Smaller teams, heterogeneous workloads

Infrastructure as Code: Terraform, CloudFormation, Pulumi

Infrastructure as Code (IaC) means defining your infrastructure (servers, databases, networks, load balancers) in code files that are version-controlled, reviewed, and deployed like application code.

Instead of clicking through a cloud console to create a database, you write a configuration file that declares the database's type, size, and settings.

A tool reads the file and creates the infrastructure.

IaC provides repeatability (the same code creates identical infrastructure every time), version history (every infrastructure change is a commit with a diff, author, and message), review process (infrastructure changes go through pull requests and code review), disaster recovery (recreate the entire infrastructure from code if a region is lost), and drift detection (compare the actual infrastructure to the code and identify manual changes that were not committed).

Terraform

Terraform (by HashiCorp) is the most widely used IaC tool. It works across every major cloud provider (AWS, Azure, GCP) and hundreds of other services (Cloudflare, Datadog, GitHub) through its provider plugin architecture.

You write configuration in HCL (HashiCorp Configuration Language), run terraform plan to preview changes, and terraform apply to execute them.

Terraform maintains a state file that records the current infrastructure.

When you change the configuration and apply it, Terraform compares the desired state (your code) with the current state (the state file) and makes only the necessary changes.

Terraform's strength is its cloud-agnostic nature.

You can manage AWS, Azure, and GCP resources in the same codebase. Its weakness is the state file management: the state file must be stored securely and consistently (typically in a remote backend like S3 with locking via DynamoDB).

CloudFormation

CloudFormation is AWS's native IaC service. It uses JSON or YAML templates to define AWS resources.

CloudFormation manages state internally (no external state file), integrates deeply with AWS services, and supports features like drift detection and rollback.

CloudFormation's advantage is zero setup for AWS-only infrastructure and tight integration with IAM, service limits, and resource dependencies. Its limitation is that it only works with AWS.

Pulumi

Pulumi takes a different approach: you write infrastructure definitions in real programming languages (TypeScript, Python, Go, Java, C#) instead of domain-specific languages like HCL or YAML.

This means you can use loops, conditionals, functions, and all the abstractions of a general-purpose language.

Pulumi appeals to teams that want to use the same language for infrastructure and application code, reducing the number of tools and languages in the stack.

Tool	Language	Cloud Support	State Management	Strengths
Terraform	HCL	Multi-cloud	External (S3, etc.)	Cloud-agnostic, largest ecosystem
CloudFormation	JSON/YAML	AWS only	Managed by AWS	Native AWS integration, no state management
Pulumi	TypeScript, Python, Go, etc.	Multi-cloud	Managed or self-hosted	Real programming languages, strong typing

Service Mesh: Istio, Linkerd, Envoy

Service mesh was covered in detail in Part II, Lesson 6 (Sidecar Proxy Pattern and Service Mesh). Here is a focused summary of the three major implementations and when each makes sense.

A service mesh provides mutual TLS, traffic management, observability, and resilience (retries, circuit breaking) for service-to-service communication. It does this by deploying a sidecar proxy alongside every service instance.

The proxies form the data plane. A control plane manages their configuration.

Istio

Istio is the most feature-rich service mesh.

It uses Envoy as its sidecar proxy and provides sophisticated traffic management (canary deployments, traffic mirroring, fault injection), security (mutual TLS, authorization policies), and observability (integrated with Prometheus, Grafana, Jaeger). Istio's complexity is both its strength and its weakness. It can do almost anything, but configuring and operating it requires significant expertise.

Choose Istio when you need advanced traffic management and your team has the capacity to manage the complexity.

Linkerd

Linkerd is the simplest service mesh. It uses its own Rust-based proxy (linkerd2-proxy) that is lighter and faster than Envoy.

Linkerd focuses on the most common use cases: mutual TLS, basic traffic metrics, and retries. It installs in minutes and has a minimal operational footprint.

Choose Linkerd when you need the core benefits of a service mesh (mTLS, observability) without the configuration overhead of Istio.

Envoy

Envoy is not a service mesh by itself but the proxy that Istio and many custom meshes are built on.

Some organizations deploy Envoy sidecars without a full mesh control plane, configuring them through static files or a lightweight xDS server.

This gives you Envoy's traffic management and observability without the control plane overhead.

Choose standalone Envoy when you need advanced proxy features but do not want a full mesh.

Mesh	Proxy	Complexity	Resource Overhead	Best For
Istio	Envoy	High	Higher	Advanced traffic management, large-scale
Linkerd	linkerd2-proxy (Rust)	Low	Lower	Core mesh features, simplicity
Envoy (standalone)	Envoy	Medium	Medium	Custom mesh, teams with Envoy expertise

Cloud-Native Application Architecture

Cloud-native is not just "running on the cloud." It is an architecture and development approach designed to fully exploit the capabilities of cloud platforms: elastic scaling, managed services, global distribution, and automated operations.

The Cloud Native Computing Foundation (CNCF) defines it as using containers, service meshes, microservices, immutable infrastructure, and declarative APIs.

Principles of Cloud-Native Architecture

Designed for automation: Every aspect of the application lifecycle is automated: building, testing, deploying, scaling, and recovering. Manual operations are the exception, not the norm. Infrastructure as Code, CI/CD pipelines, and auto-scaling are foundational.
Microservices-based: The application is decomposed into independently deployable services. Each service is small enough for one team to own, has its own data store, and communicates through well-defined APIs or events.
Containerized: Applications run in containers for consistency across environments, rapid startup, and efficient resource utilization. Container images are the standard deployment artifact.
Resilient by design: Cloud-native applications expect failure. Servers crash. Networks partition. Services slow down. The application handles these failures gracefully through redundancy, circuit breakers, retries, and graceful degradation.
Observable: Every service emits metrics, logs, and traces. Centralized monitoring and alerting provide real-time visibility into system health. Observability is built into the application from day one, not added as an afterthought.
Stateless where possible: Application instances do not store state locally. State lives in external data stores (databases, caches, object stores). This enables free horizontal scaling and zero-impact instance replacement.

The Twelve-Factor App

The Twelve-Factor App methodology, published by Heroku engineers, is a practical checklist for building cloud-native applications.

Key factors include storing configuration in environment variables (not in code), treating backing services (databases, caches, queues) as attached resources that can be swapped without code changes, building stateless processes that share nothing, exporting services via port binding, scaling out via the process model (more instances, not bigger instances), and maximizing dev/prod parity (development, staging, and production should be as similar as possible).

Not every factor applies to every application, but the methodology provides a solid baseline for cloud-native design.

Applications that follow these principles are easier to deploy, scale, and operate on any cloud platform.

Beginner Mistake to Avoid

New engineers sometimes confuse "cloud-hosted" with "cloud-native." Running a traditional monolithic application on an EC2 instance is cloud-hosted.

It does not take advantage of elastic scaling, managed services, or automated operations.

Converting that monolith to cloud-native means containerizing it, decomposing it into services, externalizing state, automating deployment, and building in observability and resilience.

Simply moving a VM to the cloud gives you flexibility in provisioning but does not give you the operational advantages that cloud-native architecture provides.

Interview-Style Question

Q: Your company currently runs a monolithic application on VMs. Leadership wants to move to "the cloud." They think this means renting VMs from AWS instead of running their own servers. What would you recommend, and what are the stages of a realistic migration?

A: Moving VMs to EC2 (lift and shift) is a valid first step because it eliminates data center management and provides flexibility, but it captures only a fraction of the cloud's value. A realistic migration has stages. Stage 1 (Lift and shift): move the existing application to EC2 instances, set up VPC networking, and configure backups to S3. This gives you immediate benefits from managed hardware. Stage 2 (Adopt managed services): replace self-managed PostgreSQL with RDS, replace self-managed Redis with ElastiCache, and use S3 for file storage instead of local disks. This reduces operational burden. Stage 3 (Containerize): package the application in Docker, deploy using ECS or EKS, and set up a CI/CD pipeline for automated deployments. This gives you faster deployments and environment consistency. Stage 4 (Decompose into services): identify the modules that benefit most from independent scaling or independent deployment and extract them as microservices. Stage 5 (Cloud-native): add auto-scaling, implement observability (Prometheus, Grafana, Jaeger), introduce infrastructure as code (Terraform), and adopt FaaS for event-driven workloads. Each stage delivers value independently. You do not need to reach stage 5 to benefit from the cloud.

Cloud Migration

KEY TAKEAWAYS

Cloud models (IaaS, PaaS, SaaS, FaaS) differ in how much you manage vs. the provider. Most systems use a mix.
AWS, Azure, and GCP offer equivalent services for compute, storage, databases, networking, and messaging. Design for capabilities, not specific vendors.
Containers are lighter, faster, and denser than VMs. Use VMs when you need strong isolation or different OS types. Use containers for microservices and modern workloads.
Docker packages applications with their dependencies into portable images. Container images are the standard deployment artifact in modern CI/CD.
Kubernetes is the industry standard for container orchestration. ECS is simpler for AWS-only teams. Nomad is simplest for heterogeneous workloads.
Infrastructure as Code (Terraform, CloudFormation, Pulumi) makes infrastructure repeatable, reviewable, and recoverable. Treat infrastructure changes like code changes.
Service meshes (Istio, Linkerd) provide mTLS, observability, and traffic management for service-to-service communication. Linkerd for simplicity, Istio for advanced features.
Cloud-native means more than running on the cloud. It means designing for automation, resilience, observability, and elastic scaling. The Twelve-Factor App methodology is a practical starting checklist.

Explore the System Design Interview Crash Course for quick learning.