3.2 Availability & Reliability - The System Design Interview Handbook

Availability

Availability measures how much of the time your system is operational and accessible to users.

A system that is down for an hour every day has poor availability.

A system that goes down for 5 minutes per year has excellent availability.

The difference between those two is what separates hobby projects from production systems that businesses depend on.

Measuring Availability: Nines (99.9%, 99.99%, 99.999%)

Availability is expressed as a percentage of uptime over a given period, usually a year. Engineers shorthand these percentages as "nines" because the number of nines in the percentage is what distinguishes each tier.

Availability	Common Name	Downtime Per Year	Downtime Per Month	Downtime Per Day
99%	Two nines	3.65 days	7.3 hours	14.4 minutes
99.9%	Three nines	8.77 hours	43.8 minutes	1.44 minutes
99.99%	Four nines	52.6 minutes	4.38 minutes	8.6 seconds
99.999%	Five nines	5.26 minutes	26.3 seconds	0.86 seconds

The jump from three nines to four nines looks small on paper (99.9% to 99.99%), but it cuts your allowed downtime from 8.7 hours per year to 52 minutes.

That is a fundamentally different engineering challenge.

Three nines means you can afford a few short outages per month.

Four nines means a single 10-minute outage burns 20% of your annual downtime budget.

Five nines means you have 5 minutes for the entire year, which is barely enough time for one deployment rollback.

Each additional nine roughly costs 10x more in engineering effort and infrastructure.

A system achieving 99% availability might run on a single server with manual restarts.

A system achieving 99.99% needs automated failover, redundancy at every layer, multi-zone deployment, and zero-downtime deployment processes.

A system achieving 99.999% needs all of that plus active-active multi-region architecture, exhaustive failure testing, and an on-call team that can resolve issues in under a minute.

Most consumer web applications target 99.9% to 99.99%.

Payment systems and critical infrastructure target 99.99% to 99.999%.

Internal tools and batch processing systems might settle for 99% or lower.

Availability in Sequence vs. in Parallel

When your system has multiple components, overall availability depends on how those components are arranged.

Components in sequence (where each component must work for the system to function) multiply their individual availabilities.

If Service A has 99.9% availability and Service B has 99.9% availability, and both must be up for the system to work, the overall availability is 99.9% × 99.9% = 99.8%.

Each additional sequential dependency lowers overall availability.

A request that passes through five services each at 99.9% availability gives you a system availability of roughly 99.5%, which is only two and a half nines.

This math explains why long synchronous call chains in microservices architectures are dangerous for availability. Every hop adds another potential failure point.

Components in parallel (where the system works as long as at least one component is functional) improve availability.

If you run two instances of a service, each at 99.9% availability, the probability that both fail simultaneously is 0.1% × 0.1% = 0.0001%.

So the combined availability is 99.9999%.

Redundancy through parallel components is how you push availability higher than any single component can achieve.

Arrangement	Formula	Example (two components at 99.9% each)	Result
Sequential (both must work)	A × B	99.9% × 99.9%	99.8%
Parallel (one must work)	1 - (1-A) × (1-B)	1 - (0.001 × 0.001)	99.9999%

The practical lesson: minimize sequential dependencies.

Maximize redundancy through parallel components. Every time you add a synchronous dependency, you are mathematically lowering your availability ceiling.

SLAs, SLOs, and SLIs

These three terms define the contractual and operational framework for availability commitments. They are distinct concepts that work together.

SLI (Service Level Indicator) is the metric you actually measure. It is a concrete, quantifiable signal: request latency at the 99th percentile, error rate, throughput, or uptime percentage. SLIs are facts derived from your monitoring data. "Our p99 latency last month was 230ms" is an SLI.
SLO (Service Level Objective) is the target you set for an SLI. It is an internal goal that your engineering team commits to. "Our p99 latency should stay below 200ms" is an SLO. SLOs are set tighter than SLAs to give you a buffer. If your SLA promises 99.9% uptime, your internal SLO might be 99.95% so you catch problems before they breach the customer-facing promise.
SLA (Service Level Agreement) is the contract with your customers. It defines the availability or performance guarantee and the consequences (usually financial credits) if the guarantee is violated. "We guarantee 99.9% uptime. If we miss it, affected customers receive a 10% service credit" is an SLA. SLAs are business commitments, not engineering goals.

The relationship flows upward: SLIs feed into SLOs, and SLOs support SLAs.

One concept that ties these together is the error budget.

If your SLO is 99.9% uptime over a month, you have a budget of 43.8 minutes of downtime.

As long as you stay within that budget, you can deploy new features, run experiments, and accept some risk.

If you exhaust the error budget, you freeze deployments and focus on reliability improvements.

Error budgets give engineering teams a quantifiable way to balance velocity (shipping new features) with reliability (keeping things stable).

High Availability Architecture Patterns

Several architectural patterns push availability beyond what a single server or single data center can provide.

Redundant pairs: Every critical component runs at least two instances. Two application servers, two load balancers, two database replicas. If one fails, the other continues serving traffic. This handles single-machine failures.
Multi-AZ (Availability Zone) deployment: Cloud providers like AWS organize their infrastructure into availability zones: physically separate data centers within the same region, connected by low-latency links. Deploying across multiple AZs protects against data center-level failures (power outages, cooling failures, network partitions within a zone). If AZ-a goes down, AZ-b and AZ-c continue operating.
Multi-region deployment: Deploying your system across geographically separate regions (US East, EU West, Asia Pacific) protects against region-level failures and natural disasters. Each region runs a complete copy of your stack. GSLB routes users to the nearest healthy region. This is the highest tier of availability but also the most complex and expensive because data synchronization across regions introduces latency and consistency challenges.
N+1 redundancy: Instead of running exactly the number of instances you need, run one extra. If you need 4 instances to handle your peak load, run 5. If any one fails, the remaining 4 handle the load without degradation. N+2 provides even more margin.

Three-tier Availability Architecture

Interview-Style Question

Q: Your system runs across two availability zones in a single region. Each zone has 5 application servers and the system needs at least 6 to handle peak traffic. What happens if one AZ goes down?

A: If one AZ goes down, you lose 5 servers, leaving only 5 in the remaining AZ. But you need 6 to handle peak traffic. Your system is degraded and may drop requests or respond slowly. The fix is N+1 redundancy across AZs: run at least 4 servers per AZ (8 total) so that losing one AZ still leaves you with 4 servers. If peak load requires 6, run at least 7 per AZ or increase total capacity to 12 across both zones so that either zone alone can handle peak. The principle is: design each AZ to survive the loss of the other, not just to share the load during normal operation.

For a detailed, system design interview prep, check out Grokking the System Design Interview course by DesignGurus.io.

KEY TAKEAWAYS

Availability is measured in nines. Each additional nine roughly costs 10x more engineering effort. Choose your target based on business criticality.
Sequential dependencies multiply and reduce availability. Parallel redundancy dramatically increases it. Minimize synchronous chains and maximize redundancy.
SLIs are what you measure. SLOs are what you target. SLAs are what you promise customers. Error budgets quantify the trade-off between reliability and feature velocity.
High availability requires redundancy at every layer: redundant instances, multi-AZ deployment, and multi-region deployment for the highest tiers.
Design so that losing any single component (server, AZ, or even an entire region) does not bring down the system.

Reliability & Fault Tolerance

Availability tells you whether the system is up.

Reliability tells you whether it does the right thing while it is up.

A system can be available (it responds to requests) but unreliable (it returns wrong answers or corrupts data).

Both properties matter, and the engineering approaches for each are different.

Reliability vs. Availability: Subtle Differences

A system is available if it responds to requests.

A system is reliable if it responds correctly and consistently over time.

Consider a banking application.

If it goes down for 10 minutes, that is an availability problem. Users cannot access their accounts, but no data is corrupted. When it comes back, everything is correct.

Now consider a scenario where the application stays up 100% of the time but occasionally calculates interest payments incorrectly. It is perfectly available but unreliable.

The second scenario is often more damaging because the errors might not be discovered for weeks.

Reliability encompasses correctness (the system produces the right results), consistency (the system produces the same results under the same conditions), and durability (data that is committed stays committed, even after failures).

In practice, availability and reliability reinforce each other.

The techniques that improve reliability (redundancy, failover, data replication) also tend to improve availability.

But they are not the same thing, and it is possible to optimize for one at the expense of the other.

A system that aggressively fails over to a warm standby might maintain high availability but introduce a brief window where the standby serves slightly stale data, reducing reliability.

Redundancy: Active-Passive, Active-Active

Redundancy is the foundation of both availability and fault tolerance.

If you have only one of something, its failure means total loss.

If you have two or more, a failure in one is absorbed by the others.

Active-passive redundancy runs one primary component (active) and one or more standby components (passive). The standby does no useful work during normal operation. It monitors the primary's health, and if the primary fails, the standby takes over.

Active-passive is simpler to implement because there are no concerns about data consistency between active nodes or traffic splitting.

The downside is wasted capacity: the passive node sits idle most of the time, consuming resources without contributing to throughput.

Active-active redundancy runs all components simultaneously, each handling a share of the traffic. If one fails, the remaining nodes absorb its traffic.

Active-active uses resources more efficiently and provides higher total throughput.

The downside is complexity: data must be synchronized across active nodes, and the system must handle the scenario where two active nodes receive conflicting writes simultaneously.

Aspect	Active-Passive	Active-Active
Normal operation	One active, others idle	All instances handle traffic
Resource utilization	Low (standby is wasted)	High (all instances contribute)
Failover mechanism	Standby detects failure and takes over	Traffic redistributed to remaining nodes
Data consistency	Simple (one writer)	Complex (multiple writers, conflict resolution)
Failover time	Seconds to minutes (standby promotion)	Near-instant (traffic redistribution)
Best for	Databases, stateful components	Stateless services, web servers, load balancers

Most production systems use active-active for stateless components (application servers, load balancers) and active-passive for stateful components (databases) where data consistency is paramount.

Failover Mechanisms: Hot Standby, Warm Standby, Cold Standby

When the primary fails, how quickly the standby can take over depends on how "ready" it is.

Hot standby is a replica that is fully running, receiving all data in real time (through replication), and ready to take over within seconds. It has the same data, the same configuration, and the same connections established. Failover is essentially flipping a switch. Database hot standbys (like PostgreSQL synchronous replication) can take over in under 10 seconds. The cost is that the hot standby consumes nearly the same resources as the primary.
Warm standby is a system that is running but not fully in sync. It might receive periodic data snapshots rather than real-time replication. Failover takes minutes because the warm standby needs to catch up on recent changes, rebuild caches, or establish connections. Warm standbys cost less than hot standbys because they can run on smaller hardware since they do not handle production traffic.
Cold standby is a system that is not running at all. It is a server that exists in inventory (or a cloud template) that can be provisioned and started when needed. Failover takes tens of minutes to hours because the cold standby must boot, install updates, restore data from backups, and warm up. Cold standbys are the cheapest option but provide the slowest recovery.

Standby Type	State During Normal Operation	Failover Time	Cost	Best For
Hot	Fully running, real-time data sync	Seconds	High	Mission-critical systems, databases
Warm	Running, periodic data sync	Minutes	Medium	Important systems with moderate RTO
Cold	Powered off, data in backups	Minutes to hours	Low	Non-critical systems, disaster recovery

Graceful Degradation and Feature Flagging

A reliable system does not just survive failures. It degrades gracefully, maintaining its most critical functions even when non-critical components fail.

Graceful degradation means that when a component fails, the system reduces functionality rather than going completely offline.

If the recommendation engine goes down on an e-commerce site, users still see products, they can still search, and they can still check out. They just see generic "popular items" instead of personalized recommendations. The core shopping experience is preserved.

Feature flagging enables graceful degradation by letting you turn individual features on and off in production without deploying new code.

Each feature is wrapped in a conditional check: if the feature flag is enabled, run the feature; if not, skip it or serve a fallback.

When a non-critical service starts failing, you flip its feature flag off, and the application gracefully omits that feature until the service recovers.

Feature flags also serve a broader purpose beyond failure handling.

They enable gradual rollouts (enable a new feature for 5% of users, then 20%, then 100%), A/B testing (show feature variant A to half of users and variant B to the other half), and kill switches (instantly disable a feature that is causing problems in production without a full deployment).

The combination of graceful degradation and feature flags is one of the most practical reliability tools available. It turns a binary outcome (everything works or everything fails) into a spectrum (everything works, most things work, or only essential things work).

Chaos Engineering and Failure Injection Testing

You cannot be confident in your system's fault tolerance unless you have actually tested it under failure conditions.

Chaos engineering is the practice of deliberately introducing failures into your production (or production-like) environment to discover weaknesses before they cause real outages.

Netflix pioneered this approach with Chaos Monkey, a tool that randomly terminates production instances during business hours.

If the system handles the termination without user-visible impact, the redundancy and failover mechanisms are working. If users see errors, the team has found a weakness to fix.

Chaos engineering goes beyond killing individual servers.

Teams inject network latency between services to test timeout handling. They simulate disk failures to verify data durability. They block network traffic between specific services to test circuit breaker behavior. They even simulate entire availability zone failures to validate multi-AZ resilience.

The key principle is that you run these experiments in a controlled way with a clear hypothesis.

"We believe that if we terminate 30% of our application servers, the remaining servers will absorb the traffic and p99 latency will stay below 500ms."

You run the experiment, observe the results, and fix whatever broke.

Chaos engineering requires mature monitoring and observability.

You need to see the impact of injected failures in real time so you can stop the experiment if things go worse than expected.

Bulkhead Pattern: Isolating Failures

The bulkhead pattern borrows from ship design. Ships are divided into watertight compartments so that a breach in one compartment does not flood the entire vessel.

The same principle applies to software systems.

In a bulkhead architecture, critical resources (thread pools, connection pools, service instances) are partitioned so that a failure in one partition does not exhaust resources used by other partitions.

Without bulkheads, your application might use a single thread pool for all outbound service calls.

If the payment service becomes slow, threads waiting for payment responses pile up and consume the entire pool.

Now the user profile service and the product catalog service cannot get threads either, even though they are perfectly healthy. One slow dependency takes down everything.

With bulkheads, you create separate thread pools for each downstream service.

The payment service gets 20 threads.

The user profile service gets 20 threads.

The product catalog gets 20 threads.

If the payment service becomes slow and its 20 threads are exhausted, the user profile and product catalog still have their own dedicated threads and continue operating normally. The failure is contained.

Bulkheads apply to connection pools (separate database connection pools per use case), worker pools (separate processing queues for different priority levels, covered in Chapter II), and even infrastructure (running different services in separate containers or separate Kubernetes namespaces with resource limits so one service cannot consume all the CPU on a node).

Circuit Breaker Pattern

A circuit breaker prevents your system from repeatedly calling a service that is failing.

Without one, a failing downstream service causes cascading failures as upstream services pile up requests, exhaust their resources, and fail in turn.

The circuit breaker has three states.

Closed (normal operation): requests flow through to the downstream service. The circuit breaker monitors the error rate.
Open (failure detected): when the error rate exceeds a threshold (say, 50% of requests fail within a 30-second window), the circuit breaker trips open. All requests immediately return an error or a fallback response without calling the downstream service. This protects both your system (no wasted resources on doomed calls) and the failing service (no additional load that prevents recovery).
Half-open (recovery test): after a configured timeout (say, 30 seconds), the circuit breaker allows a small number of test requests through to the downstream service.

If they succeed, the circuit closes and normal operation resumes.

If they fail, the circuit stays open for another timeout period.

The circuit breaker works hand-in-hand with graceful degradation.

When the recommendation service circuit is open, the application shows generic recommendations instead of personalized ones.

When the circuit closes, personalization resumes automatically.

Libraries like Hystrix (Java, now in maintenance mode), Resilience4j (Java), Polly (.NET), and Envoy's built-in circuit breaking make this pattern straightforward to implement.

Interview-Style Question

Q: Your e-commerce platform has a product page that calls four services: product details, pricing, reviews, and recommendations. The reviews service starts timing out. How do you prevent this from affecting the rest of the page?

A: Implement circuit breakers and bulkheads for each downstream call. Give each service its own connection pool and thread pool (bulkhead) so that the reviews timeout does not exhaust resources used by the other three services. Set a short timeout on the reviews call (200ms if normal response time is 50ms). After a threshold of failures, the circuit breaker for reviews trips open. The product page renders with product details, pricing, and recommendations, but shows "Reviews temporarily unavailable" or hides the reviews section entirely (graceful degradation via a feature flag). When the reviews service recovers, the circuit breaker enters half-open state, test requests succeed, and the circuit closes. Reviews reappear on the page automatically. The user experience degrades minimally rather than the entire page failing.

Circuit Breaker State Diagram

KEY TAKEAWAYS

Reliability means the system produces correct results consistently. Availability means the system is reachable. A system can be available but unreliable, and that is often worse than being temporarily unavailable.
Active-passive redundancy is simpler and suits stateful components like databases. Active-active is more efficient and suits stateless components like application servers.
Hot standbys fail over in seconds but cost as much as the primary. Cold standbys are cheap but take minutes to hours to activate. Choose based on your recovery time requirements.
Graceful degradation keeps core functionality running when non-critical components fail. Feature flags let you control degradation dynamically without deployments.
Chaos engineering tests your fault tolerance by injecting real failures. You cannot trust failover mechanisms you have never tested.
Bulkheads isolate failures by partitioning resources so one failing dependency cannot exhaust resources used by healthy dependencies.
Circuit breakers stop cascading failures by cutting off calls to failing services and returning fallback responses until the service recovers.

Disaster Recovery

Availability engineering handles individual component failures: a server crashes, a network link drops, an AZ goes offline.

Disaster recovery handles scenarios where entire regions, data centers, or systems are lost. A data center fire.

A catastrophic software bug that corrupts the primary database.

A ransomware attack that encrypts all production systems. These events are rare, but when they happen, the question is not whether you lose anything. It is how much you lose and how quickly you recover.

RPO (Recovery Point Objective) and RTO (Recovery Time Objective)

These two metrics define your disaster recovery targets.

They are the most critical numbers in any DR plan.

RPO (Recovery Point Objective) answers: how much data can you afford to lose? It is measured in time. An RPO of 1 hour means your system can lose at most 1 hour of data. If the disaster happens at 3:00 PM, you can restore to at least 2:00 PM. An RPO of zero means no data loss is acceptable, which requires synchronous replication to a secondary site.
RTO (Recovery Time Objective) answers: how long can the system be down? It is measured in time. An RTO of 4 hours means the system must be fully operational within 4 hours of the disaster. An RTO of 30 seconds means you need automated failover with hot standbys in a secondary region.

Metric	Question It Answers	Example	Implication
RPO	How much data can we lose?	RPO = 1 hour	Back up at least every hour
RTO	How long can we be down?	RTO = 4 hours	Must restore within 4 hours

RPO and RTO are business decisions, not engineering decisions. The engineering team implements whatever RPO and RTO the business requires. A lower RPO (less data loss) costs more because it requires more frequent backups or real-time replication. A lower RTO (faster recovery) costs more because it requires hot standbys and automated failover instead of cold standbys and manual procedures.

A payment processing system might need an RPO of zero and an RTO of 30 seconds. An internal reporting dashboard might tolerate an RPO of 24 hours and an RTO of 8 hours. The DR strategy and cost for each are vastly different.

Data Backup Strategies: Full, Incremental, Differential

Backups are the foundation of disaster recovery. Without them, data loss from any cause (hardware failure, human error, ransomware) is permanent.

Full backup copies all data every time. It is the simplest to restore because everything is in one backup file. The downside is time and storage: if your database is 500 GB, every full backup takes 500 GB of storage and however long it takes to copy 500 GB. Running full backups frequently is expensive.
Incremental backup copies only the data that changed since the last backup of any type. Monday's full backup copies everything. Tuesday's incremental copies only Tuesday's changes. Wednesday's incremental copies only Wednesday's changes. Incremental backups are fast and small, but restoring requires replaying the full backup plus every subsequent incremental in order. If Wednesday's incremental is corrupted, you lose Wednesday's and potentially Thursday's data too.
Differential backup copies all data that changed since the last full backup. Monday's full copies everything. Tuesday's differential copies Tuesday's changes. Wednesday's differential copies both Tuesday's and Wednesday's changes. Differentials are larger than incrementals but easier to restore: you need only the last full backup plus the latest differential.

Strategy	Backup Size	Backup Speed	Restore Speed	Restore Complexity
Full	Largest	Slowest	Fastest	Simple (one file)
Incremental	Smallest	Fastest	Slowest	Complex (full + all incrementals)
Differential	Medium	Medium	Medium	Moderate (full + latest differential)

Most production systems combine strategies.

Run a full backup weekly (Sunday night), differential backups daily (Monday through Saturday), and consider incremental backups hourly for systems with tight RPO requirements.

Store backups in a different region or cloud provider than your primary data.

A backup stored in the same data center as your database is useless if the data center burns down.

Multi-Region and Multi-Datacenter Deployment

Multi-region deployment is the most robust disaster recovery strategy. Your system runs complete, independent stacks in two or more geographic regions.

Each region can serve traffic on its own. If an entire region is lost, the remaining regions continue operating.

There are three common configurations.

Active-passive multi-region: One region (primary) handles all traffic. The secondary region receives replicated data and runs infrastructure that is ready to activate but does not serve user traffic. If the primary goes down, DNS or GSLB redirects traffic to the secondary. The failover takes seconds to minutes depending on automation. The secondary needs to handle the full traffic load, which means either provisioning it at full capacity (expensive) or accepting a degraded experience while it scales up.
Active-active multi-region: All regions handle traffic simultaneously. Users are routed to the nearest region by GSLB. Each region can absorb additional traffic if another region fails. Data is replicated bidirectionally between regions. This provides the best latency (users hit nearby servers), the best availability (any region can absorb any other's traffic), and the most complexity (bidirectional replication, conflict resolution, cross-region consistency).
Pilot light: The secondary region runs only the minimum infrastructure needed to receive replicated data: database replicas and perhaps a few core services. Everything else is shut down. In a disaster, you spin up the remaining infrastructure from saved configurations and templates. Recovery takes 30 minutes to a few hours depending on how much needs to start. Pilot light is cheaper than full active-passive because most infrastructure is not running, but recovery is slower.

Configuration	Normal Operation	Failover Time	Cost	Complexity
Active-passive	One region serves traffic	Minutes	High (full standby infrastructure)	Medium
Active-active	All regions serve traffic	Seconds (redistribute traffic)	Highest (full infrastructure everywhere)	Highest
Pilot light	Secondary receives data only	30 min to hours	Lower (minimal standby)	Medium

Regional Failover and Data Center Failover

Regional failover redirects all traffic from a failed region to a healthy one. The mechanics depend on your routing layer.

DNS-based failover uses health checks on your DNS provider (Route 53, Cloudflare DNS). When health checks for the primary region fail, DNS stops returning the primary's IP addresses and returns the secondary's instead. Failover time depends on DNS TTL. With a 60-second TTL, most users will be redirected within 1 to 2 minutes. Some clients and ISP resolvers cache DNS records longer than the TTL specifies, so a small percentage of users may take longer to redirect.
GSLB failover is similar but operates at a more sophisticated level, considering real-time health, latency, and capacity when routing decisions are made. GSLB can redirect traffic within seconds because it does not depend on client-side DNS caching in the same way.
Anycast failover uses BGP routing so that the same IP address is announced from multiple regions. Network routing automatically directs traffic to the nearest healthy region. If a region goes offline, its BGP announcement is withdrawn, and traffic reroutes to the next closest region within seconds. This is how Cloudflare and other CDN providers handle regional failures transparently.

Data center failover within a region (across availability zones) is typically handled by the load balancer and health checks.

When instances in one AZ fail health checks, the load balancer stops routing to them and sends traffic to healthy instances in other AZs. This is usually automatic and takes seconds.

Disaster Recovery Plans and Runbooks

A disaster recovery plan is a documented process that specifies exactly what happens when a disaster occurs. It answers questions that nobody wants to think about during the chaos of an actual outage:

who makes the decision to fail over?
what steps do they follow?
in what order?
how do we verify the recovery is successful?
how do we communicate status to customers?

A runbook is the specific step-by-step procedure for executing a DR action. "Failover to secondary region" is a runbook with detailed instructions: which DNS records to update, which services to start, which health checks to verify, what monitoring dashboards to watch, and what the rollback procedure is if the failover causes new problems.

The most critical property of a DR plan is that it has been tested.

A plan that exists only as a document is not a plan. It is a hope.

Regular DR drills (quarterly for most organizations, monthly for critical systems) verify that the plan works, that team members know their roles, and that the recovery time matches the RTO target.

DR drills frequently reveal surprises: a database backup that is corrupted because the backup process has been silently failing for weeks, a secondary region that cannot handle the full traffic load because its auto-scaling configuration was never updated, or a runbook step that references a tool that no longer exists.

Each drill that finds a problem is a drill that prevented an actual disaster from being worse than it needed to be.

Components of a complete DR plan include a risk assessment (what disasters are most likely and most impactful), the RPO and RTO for each critical system, the failover procedure with detailed runbooks, a communication plan (who notifies customers, what channels are used, what updates are sent), roles and responsibilities (who has authority to initiate a failover, who executes each step), a testing schedule, and a post-incident review process.

Beginner Mistake to Avoid

New engineers often set up database backups and consider disaster recovery handled. Backups are necessary but not sufficient.

A backup that has never been tested might be corrupted.

A backup restoration process that takes 6 hours does not help if your RTO is 1 hour.

And backups alone do not address the time between the last backup and the disaster (your RPO). Test your backups regularly by actually restoring them.

Time the restoration process. Verify the restored data is complete and correct.

A backup you have never restored is a backup you cannot trust.

Interview-Style Question

Q: Your company runs a SaaS application from a single AWS region. The CEO asks you to design a disaster recovery strategy with an RPO of 1 hour and an RTO of 4 hours. How do you approach this?

A: With a 1-hour RPO, data must be replicated or backed up at least hourly. Set up continuous database replication (asynchronous) to a secondary AWS region, which gives an RPO well under 1 hour (typically seconds to minutes of lag). For object storage (S3), enable cross-region replication. For the RTO of 4 hours, a pilot light approach works. In the secondary region, run database read replicas receiving continuous replication, but keep application infrastructure as pre-built CloudFormation or Terraform templates rather than running instances. If the primary region fails, promote the database replica to primary, deploy the application stack from templates (which takes 20 to 30 minutes with pre-built container images), update DNS to point to the secondary region (with a 60-second TTL set in advance), and verify health checks. The total recovery time should be well under 4 hours. Schedule quarterly DR drills where you actually execute this failover to verify timing and catch configuration drift.

AWS Disaster Recovery Architecture

KEY TAKEAWAYS

RPO defines how much data you can lose. RTO defines how long you can be down. Both are business decisions that drive engineering investment.
Combine backup strategies: weekly full backups, daily differential backups, and hourly incremental backups for tight RPO requirements. Store backups in a different region.
Multi-region deployment is the strongest DR strategy. Active-active provides the best availability and fastest failover. Pilot light is a cost-effective compromise for longer RTO targets.
Regional failover uses DNS, GSLB, or anycast routing to redirect traffic from a failed region to a healthy one. Pre-set short DNS TTLs before planned failovers.
A disaster recovery plan is only as good as its last drill. Test failover procedures regularly. Time the recovery. Fix what breaks during the drill, not during the real disaster.
Backups you have never restored are backups you cannot trust. Test restoration regularly and verify data completeness.