4.7 Performance & Cost Optimization - The System Design Interview Handbook

Performance Optimization

Performance optimization is not about making everything as fast as possible. It is about identifying the specific bottleneck that limits your system's throughput or latency and removing it.

Optimizing a component that is not the bottleneck wastes engineering time and produces no measurable improvement.

The discipline is in measuring first and optimizing second.

Profiling and Benchmarking: Identifying Bottlenecks

Before you can optimize anything, you need to know where the time is being spent.

Profiling and benchmarking give you that information.

Profiling examines a running application to find where it spends its time and resources. A CPU profile shows which functions consume the most processor time. A memory profile shows which objects allocate the most memory and where leaks might exist. A database profile shows which queries take the longest and which are called most frequently.

Language-specific profiling tools include pprof (Go), cProfile and py-spy (Python), async-profiler (Java), and Chrome DevTools Performance panel (JavaScript/Node.js).

For distributed systems, distributed tracing (covered in Chapter IV) serves as a system-wide profiler, showing you which service in a request chain contributes the most latency.

Benchmarking measures the performance of a specific component or operation under controlled conditions. You run a load test against an API endpoint, gradually increasing the request rate until performance degrades. The point where latency starts rising sharply or error rates increase is the system's capacity under current conditions.

Load testing tools include k6, Locust, JMeter, and wrk.

For meaningful benchmarks, use realistic data, realistic request patterns, and a production-like environment.

A benchmark against a local development machine with a single user tells you nothing about how the system performs with 10,000 concurrent users.

The profiling workflow is straightforward. Measure current performance.

Identify the slowest component (the bottleneck).

Optimize that component.

Measure again to verify improvement. Repeat.

Do not optimize components that are not bottlenecks.

A function that takes 1ms in a request chain where the database query takes 500ms is not worth touching.

Latency Optimization Techniques

Latency is the time between a request being sent and a response being received.

Reducing latency improves user experience directly.

Users perceive applications as "fast" when responses arrive in under 200ms and "instant" when they arrive in under 100ms.

Reduce network hops: Every network call in a request chain adds latency. If Service A calls Service B, which calls Service C, which queries the database, that is three network round trips. Eliminating one hop by having Service A call the database directly (if architecturally appropriate) saves the round trip to B. CDNs reduce latency by moving content closer to users. Caching reduces latency by eliminating network calls to slower data sources entirely.
Move computation closer to the data: If your application fetches 1 million records from a database and filters them in application code, you are transferring 1 million records over the network when you could have sent a WHERE clause and transferred only the 100 matching records. Push filtering, aggregation, and transformation to the data layer whenever possible.
Prefetch and precompute: If you know a user is likely to request a specific piece of data next (like the next page of results), fetch it before they ask. Precomputation materializes expensive calculations (like aggregating a month's analytics into a summary table) during off-peak hours so that real-time queries read from the precomputed result instead of calculating from scratch.
Use efficient serialization: JSON is human-readable but verbose. A JSON payload that is 10 KB might be 3 KB in Protocol Buffers or MessagePack. For internal service-to-service communication at high volume, the serialization format directly affects both latency (smaller payloads transfer faster) and CPU usage (binary formats parse faster than text).

Connection Pooling

Opening a new database connection involves a TCP handshake, TLS negotiation (if encrypted), authentication, and session initialization.

This process takes 20 to 100ms depending on network distance and the database.

If every request opens a new connection and closes it when done, you add that overhead to every request and stress the database with constant connection churn.

Connection pooling maintains a set of pre-established connections that are reused across requests.

When a request needs a database connection, it borrows one from the pool. When done, it returns the connection to the pool instead of closing it.

The next request gets the same connection instantly, skipping the entire setup overhead.

Connection pooling applies to databases (PgBouncer for PostgreSQL, ProxySQL for MySQL, HikariCP for Java applications), HTTP clients (most HTTP libraries maintain internal connection pools for outbound calls), Redis connections (most Redis client libraries pool connections by default), and gRPC channels (gRPC multiplexes multiple requests over a single HTTP/2 connection, effectively pooling at the protocol level).

Key configuration parameters include maximum pool size (too small causes requests to queue waiting for a connection; too large overwhelms the database with connections), minimum idle connections (keeping a few connections warm prevents cold-start latency after quiet periods), connection timeout (how long a request waits for a pool connection before failing), and idle timeout (how long an unused connection stays in the pool before being closed).

A common problem: the database supports 500 maximum connections. You have 20 application servers, each with a pool of 30 connections.

That is 600 connections, exceeding the database limit.

Either reduce the pool size per server, add PgBouncer between the application and database (PgBouncer can multiplex thousands of application connections into a smaller number of database connections), or scale the database.

Data Compression (gzip, Brotli, zstd)

Compressing data before sending it over the network reduces bandwidth usage and transfer time. A 50 KB JSON API response compressed to 8 KB with gzip transfers 6x faster on a bandwidth-constrained connection.

gzip is the oldest and most universally supported compression algorithm. Every browser, every CDN, and every HTTP client supports it. Compression ratios for text data are typically 70-85% (a 100 KB file compresses to 15-30 KB). gzip is the safe default when compatibility matters most.
Brotli was developed by Google and produces 15-25% smaller files than gzip at similar compression speeds. It is supported by all modern browsers and most CDNs. Brotli is especially effective for static assets (JavaScript, CSS, HTML) and is the preferred algorithm for web content when browser compatibility is not a concern.
zstd (Zstandard) was developed by Facebook and offers the best balance of compression ratio and speed. It can be tuned to compress as well as Brotli at speeds approaching gzip. zstd is increasingly used for internal data transfer (inter-service communication, log compression, database page compression) where the client is controlled and universal browser support is not required.

Algorithm	Compression Ratio	Speed	Browser Support	Best For
gzip	Good	Good	Universal	Broad compatibility, fallback
Brotli	Better (15-25% smaller)	Good	All modern browsers	Static web assets, CDN delivery
zstd	Best at comparable speeds	Fastest	Limited (internal use)	Internal services, logs, data pipelines

Compression is typically handled by the reverse proxy or CDN (Chapter II), not the application. Nginx, Cloudflare, and CloudFront all support gzip and Brotli compression automatically based on the client's Accept-Encoding header.

For very small payloads (under 1 KB), compression can actually increase the size due to the overhead of the compression headers. Skip compression for payloads smaller than 1-2 KB.

Lazy Loading and Eager Loading Trade-offs

Eager loading fetches all related data upfront, in a single query or at initialization time. When loading a user's profile page, eager loading fetches the user record, their recent orders, their notification count, and their profile picture URL all in one call. The initial load is slower (more data), but subsequent interactions are instant because all data is already available.
Lazy loading fetches data only when it is actually needed. The profile page initially loads just the user record. The recent orders are fetched only if the user clicks the "Orders" tab. The notification count is loaded only when the notification bell becomes visible on screen. The initial load is fast, but each interaction triggers a new data fetch with its own latency.

Approach	Initial Load	Subsequent Access	Data Transfer	Best For
Eager loading	Slower (fetches everything)	Instant (data already loaded)	More upfront	Data that will definitely be used
Lazy loading	Fast (fetches minimum)	Slower (on-demand fetch)	Less upfront	Data that may not be needed

The right choice depends on usage probability. If 95% of users who view a profile also view their orders, eager load the orders.

If only 5% click the orders tab, lazy load them.

In database queries, this trade-off appears as the N+1 query problem.

Lazy loading a list of 50 orders, each with a product name, results in 1 query for the orders plus 50 individual queries for each product. Eager loading fetches the orders and products in a single query with a join.

ORMs (like Django ORM, ActiveRecord, Hibernate) provide mechanisms to switch between lazy and eager loading per query.

Batching and Coalescing Requests

Instead of sending requests one at a time, batching groups multiple requests into a single network call.

Instead of 100 individual database queries (one per item), a batch query fetches all 100 items in one round trip.

Batching reduces network overhead (one connection setup instead of 100), reduces per-request processing overhead on the server (one query plan evaluation instead of 100), and often enables the server to optimize the combined operation (a bulk insert is faster than 100 individual inserts).

Request coalescing goes further by combining duplicate requests that arrive within a short window.

If 500 requests for the same cache key arrive within 10ms (after a cache expiration), coalescing sends one request to the database and shares the result with all 500 callers. This is the same concept as the lock-based cache stampede prevention from Chapter II, Lesson 3.

Batching examples include database operations (INSERT ... VALUES (a), (b), (c) instead of three separate inserts), API calls (GraphQL's batched queries fetch multiple resources in one request), message publishing (Kafka producers batch messages and send them together for throughput), and analytics events (buffer events on the client for 5 seconds, then send them as a batch instead of individually).

The trade-off is latency.

Batching requires waiting until you have enough items to batch or until a timeout expires.

A batch that collects items for 100ms before sending adds 100ms of latency to each individual item. Tune the batch size and timeout based on your latency tolerance.

Concurrency and Parallelism

Concurrency means multiple tasks make progress during overlapping time periods.

On a single CPU core, concurrency is achieved through task switching: the CPU works on task A, pauses, works on task B, pauses, returns to task A.

Both tasks progress, but only one executes at any given instant. Concurrency is about structure and dealing with multiple things at once.

Parallelism means multiple tasks execute simultaneously on different CPU cores or different machines. Four cores processing four requests at the same time is parallelism.

Parallelism is about execution and doing multiple things at once.

For I/O-bound workloads (web servers waiting for database responses, API calls to external services), concurrency provides enormous gains.

While one request waits for the database, the server handles other requests.

Asynchronous I/O (Node.js event loop, Python asyncio, Go goroutines, Java virtual threads) enables a single thread to manage thousands of concurrent I/O operations.

For CPU-bound workloads (image processing, video transcoding, data compression, machine learning inference), parallelism is the answer.

Distributing the computation across multiple cores or multiple machines reduces wall-clock time linearly with the number of cores (up to the point where coordination overhead exceeds the parallelism benefit).

Most production systems need both.

The web server uses concurrency to handle thousands of simultaneous connections (I/O-bound).

The image processing worker uses parallelism to resize images across all available CPU cores (CPU-bound). Using parallelism for I/O-bound work or concurrency for CPU-bound work is a common source of underperformance.

Interview-Style Question

Q: Your API endpoint fetches data from three downstream services (user profile, order history, recommendations). Each call takes 50-100ms. The total endpoint latency is 250ms. How do you reduce it?

A: The three downstream calls are currently sequential (50 + 100 + 100 = 250ms). Make them concurrent. Since the three calls are independent (none depends on the output of another), fire all three in parallel and wait for all to complete. The total latency becomes the duration of the slowest call (roughly 100ms) instead of the sum. In Node.js, use Promise.all(). In Python, use asyncio.gather(). In Java, use CompletableFuture. In Go, use goroutines with a WaitGroup. This single change cuts latency from 250ms to ~100ms without any backend optimization. If one service is occasionally slow, add a timeout (200ms) and return a partial response with a fallback for the slow service (like generic recommendations instead of personalized ones) so the user is not penalized by one slow dependency.

Sequential vs. Parallel Downstream Calls

KEY TAKEAWAYS

Profile before optimizing. Identify the actual bottleneck through measurement, not intuition. Optimizing a non-bottleneck wastes time.
Reduce network hops, move computation closer to data, prefetch anticipated data, and use efficient serialization to cut latency.
Connection pooling eliminates the overhead of establishing new connections per request. It applies to databases, HTTP clients, and Redis.
Compress responses with Brotli for web clients and zstd for internal services. Skip compression for payloads under 1-2 KB.
Eager load data that will definitely be used. Lazy load data that may not be needed. Watch for the N+1 query problem with lazy loading.
Batch multiple operations into single network calls to reduce overhead. Coalesce duplicate requests to avoid redundant work.
Use concurrency for I/O-bound work (async I/O, event loops). Use parallelism for CPU-bound work (multiple cores, distributed computation). Most systems need both.

Check out the complete System Design Interview guide 2026.

Cost Optimization

Performance optimization makes your system faster.

Cost optimization makes your system cheaper without making it unacceptably slower.

In a cloud environment where every server, every gigabyte of storage, and every network transfer has a price, cost awareness is an engineering skill, not just a finance concern.

Cost vs. Performance Trade-offs

Every performance improvement has a cost, and every cost reduction has a performance impact.

The engineering challenge is finding the sweet spot where you are not wasting money on performance nobody notices and not degrading performance below what users tolerate.

Adding a Redis cache improves read latency by 50x but adds the cost of running and maintaining a Redis cluster. Is it worth it?

If your database is struggling under read load and users are experiencing slow page loads, absolutely.

If your database handles the load fine and reads are already fast, the cache is wasted money.

Adding a CDN reduces latency for global users but costs money per GB transferred.

If 80% of your users are in one country and your servers are in that country, CDN value is limited.

If users are spread across six continents, the CDN pays for itself in user experience improvements.

The practical framework: quantify the performance problem in business terms (how many users are affected, what is the impact on conversion or retention), estimate the cost of the solution (monthly infrastructure cost plus engineering time to implement), and compare the two.

If slow page loads cost you $50,000/month in lost sales and a CDN costs $2,000/month, the CDN is a clear win.

If the CDN costs $10,000/month and the performance improvement is barely perceptible, reconsider.

Reserved vs. On-Demand vs. Spot Instances

Cloud providers offer different pricing models for compute resources, and choosing the right model for each workload can reduce costs by 30-70%.

On-demand instances are the default. You pay by the hour or second with no commitment. You can start and stop them at any time. On-demand is the most flexible and the most expensive per unit of compute. Use it for unpredictable workloads, short-lived tasks, and development environments.
Reserved instances (or savings plans) offer 30-60% discounts in exchange for committing to 1 or 3 years of usage. If you know your baseline compute needs (your application always runs at least 10 servers), reserving those 10 servers saves significant money. The risk is overcommitting: if your needs decrease, you still pay for the reservation.
Spot instances (AWS) or preemptible VMs (GCP) offer 60-90% discounts on spare cloud capacity. The catch is that the cloud provider can reclaim them with a 2-minute warning when demand for that capacity rises. Spot instances are ideal for fault-tolerant, interruptible workloads: batch processing, data analysis, CI/CD builds, and stateless workers that can be replaced without data loss.

Pricing Model	Discount	Commitment	Interruption Risk	Best For
On-demand	None (full price)	None	None	Variable workloads, dev/test
Reserved	30-60%	1 or 3 years	None	Stable baseline capacity
Spot/Preemptible	60-90%	None	Can be reclaimed	Batch processing, CI/CD, stateless workers

A well-optimized production system uses all three. Reserved instances cover the baseline (the minimum compute you always need).

On-demand handles auto-scaling above the baseline (the variable load). Spot handles interruptible batch workloads (nightly data processing, report generation).

Right-Sizing Infrastructure

Right-sizing means matching your infrastructure resources to your actual workload. It sounds obvious, but it is one of the most common sources of wasted cloud spending.

The typical pattern: an engineer provisions a large instance "just in case," the application uses 15% of the CPU and 20% of the RAM, and nobody revisits the sizing.

Across 50 such instances, the company pays for 80% more compute than it uses.

Right-sizing starts with measurement. Monitor CPU, memory, disk, and network utilization for each instance over a representative period (at least 2 weeks, including peak traffic).

If an instance consistently runs at 20% CPU, it is oversized. Downsize to a smaller instance type.

Cloud providers offer right-sizing recommendations. AWS Compute Optimizer, GCP Recommender, and Azure Advisor analyze your utilization data and suggest cheaper instance types that still meet your performance needs.

Right-sizing is not a one-time exercise. Workloads change.

A service that needed a large instance six months ago might need a smaller one now (or a larger one). Review sizing quarterly as part of your cost management routine.

Watch for a common trap: right-sizing based only on average utilization.

If a service averages 30% CPU but spikes to 95% during peak hours, downsizing based on the average will cause performance problems during peaks. Size for peak, and use auto-scaling to reduce capacity during off-peak hours.

Data Lifecycle Management and Tiered Storage

Data does not maintain the same value or access frequency over its lifetime.

Yesterday's transaction logs are accessed hundreds of times.

Last month's logs are accessed occasionally.

Last year's logs are accessed only during audits. Storing all of this data on the same high-performance (and expensive) storage is wasteful.

Tiered storage assigns data to different storage classes based on access frequency and business value.

Hot storage (SSDs, in-memory databases, high-performance block storage) holds data that is accessed frequently and requires low latency: active user sessions, recent orders, current metrics. This is the most expensive tier.
Warm storage (standard SSDs, regular block storage) holds data accessed occasionally: last month's orders, recent reports, medium-term logs. Moderate cost, moderate performance.
Cold storage (S3 Standard-Infrequent Access, Azure Cool Blob, GCS Nearline) holds data rarely accessed but needed for operational purposes: quarterly reports, last year's logs, compliance archives. Low cost, higher retrieval latency.
Archive storage (S3 Glacier, Azure Archive, GCS Archive) holds data kept for compliance or legal reasons that is almost never accessed. Very low storage cost but high retrieval latency (hours) and retrieval cost.

Tier	Access Frequency	Latency	Cost	Examples
Hot	Constant	Milliseconds	Highest	Active sessions, current data
Warm	Weekly/monthly	Milliseconds to seconds	Moderate	Recent reports, 30-day logs
Cold	Quarterly/yearly	Seconds to minutes	Low	Annual archives, old backups
Archive	Rarely (compliance)	Hours	Lowest (storage), high (retrieval)	Legal holds, regulatory retention

Data lifecycle policies automate the movement of data between tiers. AWS S3 Lifecycle rules can automatically transition objects from S3 Standard to S3 Infrequent Access after 30 days, to S3 Glacier after 90 days, and delete them after 7 years.

You configure the policy once, and data flows through the tiers automatically.

Database-level lifecycle management works similarly.

Partition time-series data by date.

Keep the last 30 days in the primary database.

Move older partitions to a cheaper database or export them to S3 for archival.

Delete data that exceeds the retention policy.

Software Cost Estimation Techniques

Estimating the infrastructure cost of a system design is a valuable skill in interviews and in real engineering. It demonstrates that you think about the financial sustainability of your designs, not just their technical elegance.

Compute costs: Estimate the number of servers needed based on your QPS and capacity-per-server estimates (from Part I, Lesson 2). Multiply by the hourly cost of the instance type. For example: 20 servers × $0.10/hour × 730 hours/month = $1,460/month for on-demand. With reserved instances at a 40% discount: $876/month.
Storage costs: Estimate total data volume (from the capacity planning section in Chapter 1). Multiply by the per-GB-month cost of the storage tier. For example: 10 TB of S3 Standard at $0.023/GB = $230/month. 100 TB of S3 Glacier at $0.004/GB = $400/month.
Network costs: Data transfer between services in the same region is usually free or cheap. Data transfer out to the internet is typically $0.05-0.09/GB. If your system serves 100 TB of data per month to users, network costs can be $5,000-9,000/month. CDNs often offer cheaper egress rates than direct cloud provider data transfer.
Database costs: Managed databases (RDS, Cloud SQL) charge for instance size, storage, I/O operations, and backups. A production PostgreSQL instance with 8 vCPUs, 32 GB RAM, and 1 TB of storage on AWS RDS costs roughly $800-1,200/month. Add read replicas and the cost multiplies.
Total cost of ownership includes not just infrastructure but also engineering time to build and maintain, monitoring and observability tooling, and support and on-call costs. A system that costs $500/month in infrastructure but requires 2 engineers to maintain full-time is not cheap.

Beginner Mistake to Avoid

New engineers sometimes design systems with only performance in mind and are shocked by the cloud bill.

Running 50 large EC2 instances 24/7 "for redundancy" when traffic is only high for 4 hours per day wastes 83% of the compute budget. Before finalizing a design, estimate the monthly cost.

If the number is surprisingly high, look for optimization opportunities: auto-scaling to reduce capacity during quiet hours, spot instances for batch workloads, tiered storage for aging data, and reserved instances for baseline capacity.

Cost awareness is an engineering skill, and demonstrating it in an interview signals senior-level thinking.

Interview-Style Question

Q: Your company's monthly cloud bill has grown from $20,000 to $80,000 over the past year. The CTO asks you to reduce it without degrading user experience. How do you approach this?

A: Start with visibility. Get a detailed cost breakdown by service, resource type, and tag. Identify the top 5 cost categories (typically compute, database, storage, data transfer, and third-party services). For each, apply targeted optimizations. Compute: analyze utilization metrics. Right-size instances that are over-provisioned (most teams find 30-40% of instances are oversized). Switch stable baseline instances to reserved pricing (30-60% savings). Enable auto-scaling to reduce off-peak capacity. Move batch workloads to spot instances. Database: evaluate if the instance type is oversized. Check for expensive queries that could be optimized with indexes or caching (reducing the database load might let you downsize). Add a Redis cache for frequent queries if one does not exist. Storage: implement lifecycle policies to move infrequently accessed data from hot to cold storage. Delete orphaned EBS volumes, old snapshots, and unused S3 objects. Data transfer: verify a CDN is serving static assets (CDN egress is typically cheaper than direct cloud egress). Minimize cross-region data transfer. Compress inter-service payloads. The typical outcome of a first optimization pass is 30-50% cost reduction without any impact on user experience.

KEY TAKEAWAYS

Every performance improvement has a cost, and every cost reduction has a performance impact. Quantify both and find the sweet spot for your business.
Use reserved instances for stable baseline capacity (30-60% savings), on-demand for variable scaling, and spot instances for interruptible workloads (60-90% savings).
Right-size infrastructure by measuring actual utilization. Size for peak load, not average, and use auto-scaling to reduce capacity during quiet periods.
Implement data lifecycle policies to automatically move aging data from hot storage to warm, cold, and archive tiers. Delete data that exceeds retention requirements.
Estimate infrastructure costs during system design: compute, storage, network, and database. Cost awareness is an engineering skill that signals senior-level thinking.
A first optimization pass on a cloud bill typically yields 30-50% savings through right-sizing, reserved pricing, lifecycle policies, and eliminating waste.