6.3 Testing at Scale - The System Design Interview Handbook

Unit Testing, Integration Testing, System Testing

Testing exists in layers, and each layer catches a different category of bugs.

Skipping a layer does not save time. It transfers the cost of finding bugs to a more expensive stage: production.

Unit Testing

A unit test verifies a single function, method, or class in isolation. It does not touch databases, network services, file systems, or any external dependency.

External dependencies are replaced with mocks or stubs so the test runs entirely in memory, in milliseconds.

A unit test for a calculateDiscount(price, tier) function verifies that a premium user gets 20% off, a standard user gets 10% off, a negative price returns an error, and a null tier returns zero discount.

Each test case exercises one behavior of one function.

Unit tests are fast (thousands run in seconds), deterministic (same inputs always produce same outputs because there are no external dependencies), and cheap to write. They catch logic errors, boundary conditions, and regressions in individual components.

The limitation of unit tests is that they verify components in isolation.

A function that works perfectly alone might fail when integrated with another component because of incorrect assumptions about data formats, error handling, or timing.

Integration Testing

Integration tests verify that multiple components work together correctly. They test the boundaries between components: the API controller talking to the service layer, the service layer talking to the database, the application talking to an external API.

An integration test for an order system might create a real database (often a disposable Docker container running PostgreSQL), call the createOrder API endpoint, verify that the order is stored correctly in the database, and check that an event was published to the message queue.

Integration tests are slower than unit tests (seconds to minutes each) because they involve real infrastructure. They are more fragile because they depend on external services being available and correctly configured.

But they catch the bugs that unit tests miss: serialization mismatches, incorrect SQL queries, misconfigured connections, and wrong assumptions about API contracts.

System Testing (End-to-End Testing)

System tests verify the entire application from the user's perspective. They simulate real user workflows: sign up, browse products, add to cart, checkout, receive confirmation email. System tests exercise the full stack: frontend, backend, databases, caches, message queues, and external integrations.

System tests are the most realistic but also the slowest, most fragile, and most expensive to maintain.

A test that drives a browser through a checkout flow might take 30 seconds and fail because a CSS selector changed, a third-party payment sandbox is slow, or a test data dependency was not set up correctly.

Most teams follow the testing pyramid: many unit tests at the base (fast, cheap, numerous), fewer integration tests in the middle (slower, more valuable per test), and a small number of system tests at the top (slowest, most realistic). Inverting the pyramid (many system tests, few unit tests) creates a slow, fragile test suite that discourages frequent testing.

Test Type	Scope	Speed	Dependencies	Catches
Unit	Single function/class	Milliseconds	None (mocked)	Logic errors, boundary conditions
Integration	Multiple components	Seconds to minutes	Real databases, APIs	Interface mismatches, configuration errors
System (E2E)	Full application flow	Seconds to minutes	Everything	Workflow bugs, user-facing regressions

Interview-Style Question

Q: Your team has a 2-hour test suite that runs before every deployment. Developers avoid running tests locally because they are too slow. Deployments are limited to once per day. How do you fix this?

A: The test suite is likely top-heavy: too many slow integration and end-to-end tests, not enough fast unit tests. Restructure using the testing pyramid. First, identify business logic that is tested through integration tests but could be tested faster as unit tests with mocked dependencies. Move those to unit tests. Second, parallelize the remaining integration tests across multiple machines or containers (most CI platforms support parallelism). Third, split the test suite into stages: unit tests run first (2 minutes), then integration tests (15 minutes), then a small set of critical end-to-end tests (10 minutes). Fast tests gate the pipeline early, so obvious failures are caught in minutes, not hours. Fourth, introduce selective testing: only run tests related to the changed code on feature branches, and run the full suite on the main branch merge. The target is a pipeline that completes in under 20 minutes, enabling multiple deployments per day.

Contract Testing for Microservices

In a monolithic application, integration between components is verified at compile time or through shared tests.

In a microservices architecture, services communicate over the network through APIs and events.

When the Order Service changes its API response format, does the Payment Service still handle it correctly?

Integration tests can verify this, but they require both services to be running simultaneously in a test environment.

As the number of services grows, maintaining a fully integrated test environment becomes increasingly difficult.

Contract testing solves this without requiring all services to run together.

How Contract Testing Works

A contract is a formal agreement between a consumer (the service making the call) and a provider (the service being called).

The contract specifies what the consumer expects: "I will send a POST to /orders with a body containing userId and items, and I expect a 201 response with an orderId field."

The consumer writes the contract based on how it uses the provider's API.

The provider runs the contract against its actual implementation.

If the provider satisfies the contract (returns the expected response for the expected request), the contract passes.

If a code change breaks the contract, the provider's build fails before it is deployed.

Pact is the most widely used contract testing framework.

The consumer generates a "pact file" (the contract) during its unit tests.

The pact file is shared with the provider (through a Pact Broker or artifact repository). The provider runs the pact file against its implementation in its own CI pipeline. Both sides can release independently: as long as both pass the shared contracts, they are compatible.

Why It Matters

Contract testing provides fast feedback without the operational overhead of running all services together. Each service's CI pipeline independently verifies its contracts in seconds, not the minutes or hours needed for full integration tests.

It also creates explicit documentation of service dependencies.

The contract file shows exactly what the consumer expects from the provider.

When a provider team wants to change an API, they check which contracts depend on the affected fields.

If the change breaks a contract, they coordinate with the consumer team before deploying.

Testing Approach	Requires All Services Running?	Feedback Speed	Maintenance Cost
Full integration testing	Yes	Slow (minutes to hours)	High (test environment management)
Contract testing	No (each service tested independently)	Fast (seconds)	Low (contracts are simple specifications)

Load Testing and Stress Testing

Functional tests verify that your system produces correct results.

Load and stress tests verify that it continues to produce correct results under pressure.

Load Testing

Load testing applies a realistic traffic pattern to your system and measures how it performs.

The goal is to answer: can the system handle the expected production load while meeting latency and throughput requirements?

A load test for an e-commerce platform might simulate 10,000 concurrent users browsing products, 1,000 adding items to carts, and 200 checking out simultaneously. You measure response times (p50, p95, p99), error rates, and resource utilization (CPU, memory, database connections) under this load.

The test reveals bottlenecks that do not appear under light traffic: a database query that runs fine with 10 concurrent users but locks under 1,000, a connection pool that exhausts at 500 connections, or a cache that starts evicting entries when the working set grows beyond memory.

Load testing tools include k6 (developer-friendly, scriptable in JavaScript), Locust (Python-based, distributed), JMeter (mature, GUI-based, widely used in enterprise), Gatling (Scala-based, detailed reports), and wrk (lightweight, high-performance HTTP benchmarking).

Stress Testing

Stress testing pushes the system beyond its expected capacity to find its breaking point.

If your system is designed for 10,000 requests per second, stress testing ramps up to 20,000, 50,000, and 100,000 to observe what happens.

Does the system degrade gracefully (response times increase but no errors)?

Does it shed load intelligently (rate limiting kicks in, returning 429 responses)?

Or does it collapse catastrophically (all requests fail, servers crash, data is lost)?

Stress testing reveals how your system fails, which is as important as how it performs.

A system that returns 429 errors under extreme load and recovers when load decreases is well-designed.

A system that crashes and requires manual intervention to restart is not.

Best Practices

Test against a production-like environment with realistic data volumes, not a staging environment with 100 records in the database.

Run load tests regularly (weekly or as part of the CI pipeline) to catch performance regressions early, not just before major launches.

Start with a baseline test at current production traffic levels, then gradually increase to identify the capacity ceiling.

Set pass/fail criteria based on your SLOs (p99 latency below 500ms, error rate below 0.1%) so load tests can be automated and gate deployments.

Load Test

Chaos Engineering: Chaos Monkey, LitmusChaos

Chaos engineering was introduced in Chapter III, as a practice for verifying that your system handles failures gracefully.

The Methodology

Chaos engineering follows a scientific method.

Define a steady state hypothesis ("our system handles 5,000 RPS with p99 latency under 300ms").

Introduce a failure (kill a server, add network latency, fill a disk).

Observe whether the system maintains steady state.

If it does, your resilience works.

If it does not, you have found a weakness to fix before it causes a real incident.

The experiment must be controlled.

Define the blast radius (which components are affected), the duration (how long the failure lasts), the abort conditions (stop immediately if user-facing impact exceeds X%), and the rollback plan (how to end the experiment and restore normal operation).

Chaos Monkey and the Netflix Simian Army

Netflix pioneered chaos engineering with Chaos Monkey, a tool that randomly terminates production EC2 instances during business hours.

The philosophy: if your system cannot survive the loss of a single instance, it is not resilient enough for production.

Netflix expanded Chaos Monkey into the Simian Army, a collection of tools that inject different types of failures.

Latency Monkey injects network delays. Conformity Monkey checks for instances that do not follow best practices.

Chaos Gorilla simulates the failure of an entire AWS availability zone.

LitmusChaos

LitmusChaos is an open-source chaos engineering platform designed for Kubernetes environments. It provides pre-built chaos experiments as ChaosEngine resources that you apply to your cluster.

Experiments include pod deletion (kill random pods), network chaos (add latency, partition pods), disk chaos (fill storage), CPU and memory stress (exhaust resources), and DNS chaos (make DNS resolution fail).

LitmusChaos integrates with CI/CD pipelines so chaos experiments can run automatically as part of your deployment process.

A deployment passes only if the chaos experiment confirms the system's resilience.

Other Chaos Tools

Gremlin is a commercial chaos engineering platform that provides a curated set of failure injection types, a safe abort mechanism, and built-in monitoring. It supports both Kubernetes and traditional infrastructure.

AWS Fault Injection Simulator (FIS) is Amazon's managed chaos service. It integrates natively with AWS resources and can terminate EC2 instances, throttle API calls, inject network latency between services, and simulate AZ outages.

Toxiproxy (by Shopify) is a lightweight proxy for simulating network conditions. You route traffic through Toxiproxy and configure it to add latency, drop connections, limit bandwidth, or return errors. It is especially useful for testing how your application handles degraded network conditions to downstream services.

When to Start

Chaos engineering requires mature monitoring and observability.

If you cannot observe the impact of a failure injection in real time, you cannot run safe experiments.

Start with chaos engineering after you have solid metrics, logging, and alerting.

Begin with non-production environments and game days (scheduled exercises).

Graduate to production only when you have confidence in your monitoring, your abort mechanisms, and your team's incident response skills.

Performance Regression Testing

Performance regression testing detects when a code change makes the system slower.

A new feature that adds 50ms of latency to the checkout flow might pass all functional tests while degrading the user experience for millions of users.

How It Works

Establish a performance baseline by running a standardized load test against a stable version of the system.

Record key metrics: p50 latency, p95 latency, p99 latency, throughput, error rate, and resource utilization for each critical endpoint.

On every code change (or every deployment candidate), run the same load test against the new version and compare the results to the baseline.

If p99 latency increases by more than 10%, or throughput drops by more than 5%, the change is flagged as a performance regression.

Integrating Into CI/CD

Performance regression tests should run automatically in the CI pipeline, not as a manual step that someone remembers to perform before a release.

The pipeline provisions a performance test environment (a dedicated cluster that matches production configuration), deploys the new build, runs the load test, compares results against the baseline, and fails the build if regressions exceed thresholds.

This is more expensive than unit or integration tests (it requires a production-like environment and takes minutes to hours), so many teams run performance tests on a schedule (nightly) rather than on every commit.

Critical-path endpoints (checkout, search, login) might have performance tests that run on every merge to the main branch.

Avoiding False Positives

Performance test results are inherently variable. Network conditions, garbage collection timing, and noisy neighbor effects in cloud environments cause measurements to fluctuate between runs.

A single run showing 5% higher latency might be noise, not a regression.

Mitigations include running the test multiple times and comparing averages, using dedicated (non-shared) test infrastructure to reduce variability, comparing distributions (not just averages) using statistical methods, and setting thresholds that account for natural variance (flag a regression at 15% increase, not 2%).

Testing in Production: Canary Analysis, Synthetic Monitoring

Testing environments, no matter how production-like, can never fully replicate production.

Different traffic patterns, different data volumes, different network conditions, and different user behaviors mean that some bugs only manifest in production.

Testing in production is the practice of using your live environment as a testing ground, safely.

Canary Analysis

Canary analysis is the most common form of testing in production. A new version is deployed to a small subset of production traffic (the canary).

Automated monitoring compares the canary's metrics against the baseline (the remaining production instances running the old version).

Automated canary analysis tools like Kayenta (originally from Netflix, now part of Spinnaker) compare metrics between the canary and the baseline using statistical methods.

If the canary's error rate is statistically significantly higher than the baseline's, the analysis fails and the deployment is rolled back automatically.

Key metrics for canary analysis include error rate (the most sensitive indicator of a broken deployment), latency percentiles (a new code path might add latency that does not show up as errors), resource utilization (a memory leak might not manifest in a 30-minute test but shows as rising memory during canary observation), and business metrics (conversion rate, cart abandonment rate, or any metric specific to the affected feature).

The observation period should be long enough to capture representative traffic patterns.

A 15-minute canary during a low-traffic period might miss issues that only appear under peak load.

One to two hours with traffic that includes at least one peak period provides better coverage.

Synthetic Monitoring

Synthetic monitoring runs automated tests against your production system continuously, simulating user interactions from external locations.

Unlike real user monitoring (which passively observes real users), synthetic monitoring actively generates traffic and checks results.

A synthetic monitor for an e-commerce site might run every 5 minutes from 10 global locations: load the homepage (verify response under 2 seconds), search for a product (verify results appear), add a product to the cart (verify cart updates), and begin checkout (verify checkout page loads).

If any step fails or exceeds the latency threshold, the monitor triggers an alert.

Synthetic monitoring catches problems that real user monitoring might miss or detect too slowly.

If a deployment breaks the checkout flow at 3 AM when traffic is low, real user monitoring might not surface the issue until morning when users start complaining.

Synthetic monitoring detects it in 5 minutes regardless of traffic volume.

Synthetic monitoring tools include Datadog Synthetics, New Relic Synthetics, Pingdom, Checkly, and Grafana Synthetic Monitoring. Most support browser-based tests (using headless Chrome to simulate real user interactions) and API-based tests (sending HTTP requests and validating responses).

Feature Observation

Beyond canary analysis and synthetic monitoring, testing in production includes observing new features through monitoring and analytics.

When a new feature is rolled out behind a feature flag, you monitor its specific metrics: how many users interact with it, what errors it produces, how it affects downstream services, and whether it achieves its intended outcome.

This is not traditional testing (there is no pass/fail gate). It is continuous observation that informs the decision to continue rolling out, pause, or roll back the feature.

The monitoring infrastructure provides the data. The feature flag system provides the control.

Together, they turn production into a controlled experiment environment.

Beginner Mistake to Avoid

New engineers sometimes hear "testing in production" and think it means skipping the test suite and deploying untested code. That is the opposite of what testing in production means.

Testing in production is an additional layer of verification on top of a comprehensive automated test suite.

Code that reaches production has already passed unit tests, integration tests, security scans, and performance benchmarks.

Testing in production catches the problems that pre-production testing cannot: real traffic patterns, real data volumes, real network conditions, and real user behavior. It is the final safety net, not a replacement for the nets below it.

Interview-Style Question

Q: Your team is deploying a major rewrite of the payment processing flow. The rewrite changes the database queries, the external payment API integration, and the error handling logic. How do you test this safely?

A: Layer every type of testing covered in this chapter. Start with comprehensive unit tests for the new payment logic (mocked dependencies, fast, catch logic errors). Add integration tests that verify the new database queries against a real database and the new API integration against a sandbox. Run contract tests to verify the payment service still satisfies the contracts expected by the order service and the refund service. Run a load test to verify the new implementation handles production-level throughput without latency regression. Deploy using a canary strategy: route 2% of payment traffic to the new version while 98% continues on the old version. Run automated canary analysis comparing error rates, latency, and success rates between the canary and baseline. Monitor for at least 2 hours including a peak traffic period. If canary metrics are healthy, gradually increase to 10%, 25%, 50%, 100%. Simultaneously, run synthetic monitors that execute the full checkout flow every 5 minutes to catch any user-facing regression. If anything looks wrong at any stage, the canary traffic is routed back to the old version within seconds. The old version remains deployed and ready until the new version has run at 100% for a stability period of one to two weeks.

Types of Tests

KEY TAKEAWAYS

Follow the testing pyramid: many fast unit tests, fewer integration tests, a small number of end-to-end tests. Inverting the pyramid creates a slow, fragile suite.
Contract testing verifies service compatibility without running all services together. Each service independently tests its contracts in its CI pipeline.
Load testing measures performance under expected traffic. Stress testing finds the breaking point. Both should run regularly, not just before launches.
Chaos engineering injects real failures to verify resilience. Start in non-production environments. Graduate to production only with mature monitoring and abort mechanisms.
Performance regression testing catches code changes that make the system slower. Integrate into CI/CD with statistical comparison against baselines.
Testing in production (canary analysis, synthetic monitoring) catches problems that pre-production environments cannot reproduce. It is the final safety net on top of a comprehensive automated test suite, not a replacement for it.
Synthetic monitoring runs automated user workflows against production continuously, catching issues even during low-traffic periods when real users might not surface problems.

Explore the latest System Design Interview guide.