3.3 Consistency & Consensus - The System Design Interview Handbook

Consistency Models

When your data lives on a single database server, consistency is simple.

A write happens, and the next read sees it.

There is one copy of the data and one source of truth. But the moment you replicate data across multiple servers, regions, or data centers, a fundamental question arises: when a write happens on one server, how quickly must every other server reflect that change?

The answer depends on your consistency model.

Different models make different promises about when readers will see updated data, and each one trades something to deliver that promise.

Strong Consistency

Strong consistency guarantees that after a write completes, every subsequent read from any node in the system returns the updated value.

There is no window where stale data is visible.

If you update your profile picture on server A, a read from server B one millisecond later returns the new picture.

Achieving strong consistency in a distributed system requires coordination between nodes on every write.

The write is not acknowledged until all replicas confirm they have the updated data (or at least a majority, depending on the protocol). This coordination takes time because it involves network round trips between nodes, which adds latency to every write.

Strong consistency is essential for systems where correctness is non-negotiable: bank account balances, inventory counts during checkout, airline seat reservations, and medical records.

If two users see different available inventory and both purchase the last item, someone is getting a disappointing email.

Eventual Consistency

Eventual consistency guarantees that if no new writes occur, all replicas will eventually converge to the same value.

The word "eventually" is doing significant work in that sentence.

After a write, there is a window (typically milliseconds to seconds, but potentially longer during network issues) where different replicas may return different values.

Eventual consistency is the default model for most NoSQL databases and distributed systems because it enables much higher availability and lower latency than strong consistency.

Writes are acknowledged as soon as one node records them, without waiting for all replicas to synchronize.

For many applications, eventual consistency is perfectly acceptable.

A social media post that takes 2 seconds to appear on a friend's feed is not a problem.

A product rating that takes a few seconds to reflect a new review is fine.

The system is faster and more available because it does not wait for global agreement on every write.

The risk is reading stale data during the convergence window. If your system cannot tolerate any staleness, eventual consistency is not sufficient.

Weak Consistency

Weak consistency provides no guarantee that a read will ever see a specific write.

After a write completes, reads might or might not return the updated value.

The system makes a best-effort attempt but provides no formal promise about when (or even if) all replicas will converge.

This sounds unreliable, and it is, for data that matters. But weak consistency is appropriate for use cases where losing some data is acceptable.

Real-time multiplayer game state, voice and video call data, and live sensor readings are all scenarios where slightly stale or missing data is preferable to the latency cost of stronger guarantees.

If a frame of video is lost, the stream continues. Nobody rewinds to recover it.

Causal Consistency

Causal consistency sits between strong and eventual consistency. It guarantees that operations that are causally related are seen by all nodes in the correct order. Operations that are not causally related can be seen in any order.

Two operations are causally related if one depends on the other.

If user A posts a message and user B replies to that message, the reply is causally dependent on the original post.

Causal consistency guarantees that every node sees the post before the reply. But two unrelated posts from different users can appear in any order on different nodes.

Causal consistency is useful for collaborative applications, comment threads, and messaging systems where the order of related events matters but global ordering of all events is unnecessary and expensive.

Read-Your-Writes Consistency

Read-your-writes consistency guarantees that after you write data, your own subsequent reads will always see that write. Other users might see stale data temporarily (eventual consistency), but you always see your own updates.

This is a practical middle ground that eliminates the most jarring user experience problem of eventual consistency.

When a user updates their profile and the page refreshes showing the old profile, they think the system is broken.

Read-your-writes consistency prevents this by ensuring the user who made the change always sees it immediately, even if the change has not propagated to all replicas yet.

Implementation typically involves routing the user's reads to the same node that processed their write for a short window after the write, or tracking a write timestamp and ensuring reads come from a replica that has caught up past that timestamp.

Linearizability vs. Serializability

These two terms sound interchangeable but describe different guarantees, and confusing them is a common mistake.

Linearizability is a consistency guarantee about individual operations on a single object. It says that once a write completes, all subsequent reads (from any client, on any node) must return that write's value or a later one. Operations appear to take effect at a single point in time between their start and completion. Linearizability is essentially what people mean by "strong consistency."
Serializability is an isolation guarantee about transactions involving multiple operations across potentially multiple objects. It says that the result of executing multiple transactions concurrently is the same as if they had executed in some sequential order. Serializability is the strongest isolation level in databases (covered in Part II, Lesson 2).

A system can be serializable without being linearizable (transactions are ordered consistently but individual reads might not reflect the latest write), and linearizable without being serializable (individual reads are always current but concurrent multi-object transactions might interleave incorrectly).

The strongest possible guarantee is strict serializability (also called linearizable serializability), which provides both: transactions execute as if in some sequential order, and that order respects real-time ordering.

Google's Spanner database achieves this through synchronized clocks across data centers, but it is exceptionally difficult and expensive to implement.

Model	Guarantee	Latency Impact	Best For
Strong / Linearizable	All reads see the latest write	Highest (coordination required)	Financial transactions, inventory, reservations
Eventual	All replicas converge over time	Lowest (no coordination on write)	Social feeds, content platforms, DNS
Weak	No convergence guarantee	Minimal	Real-time media, gaming, sensor streams
Causal	Related operations are ordered	Moderate	Messaging, comments, collaboration
Read-your-writes	You always see your own writes	Low to moderate	User profiles, settings, preferences

Interview-Style Question

Q: You are designing a collaborative document editor. Which consistency model would you choose and why?

A: Causal consistency is the best fit. In a collaborative editor, the order of related operations matters. If user A types "Hello" and user B adds " World" after it, every viewer must see "Hello" before "Hello World," never the reverse. But two users editing unrelated sections of the document can have their changes applied in any order without causing confusion. Strong consistency would add too much latency for a real-time editing experience. Eventual consistency would risk showing edits in the wrong order, which confuses users. Causal consistency preserves the order of causally related edits while allowing unrelated edits to propagate freely, balancing correctness with the low latency that real-time collaboration demands.

Consistency Models

KEY TAKEAWAYS

Strong consistency guarantees every read sees the latest write. It requires coordination that adds latency but is essential for financial and inventory systems.
Eventual consistency allows temporary staleness but enables high availability and low latency. It is the default for most distributed systems.
Read-your-writes consistency eliminates the most frustrating user experience problem by ensuring users always see their own updates immediately.
Causal consistency preserves the order of related operations without the cost of global ordering. It suits collaborative and messaging applications.
Linearizability is about single-object consistency. Serializability is about multi-object transaction isolation. Strict serializability combines both but is extremely expensive to achieve.
Choose your consistency model based on what your application cannot tolerate, not based on what sounds strongest.

The CAP Theorem

The CAP theorem is one of the most cited (and most misunderstood) principles in distributed systems. It defines a fundamental constraint that every distributed system must confront, and understanding it is essential for making informed architecture decisions.

It is also a concept you will encounter in every system design interview and throughout resources like Grokking the System Design Interview.

Consistency, Availability, and Partition Tolerance

The CAP theorem states that a distributed system can provide at most two of these three guarantees simultaneously:

Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. (This is linearizable consistency, not the "C" in ACID.)

Availability (A): Every request receives a non-error response, without guaranteeing that it contains the most recent write. The system always answers, even if the answer might be stale.

Partition Tolerance (P): The system continues to operate despite network partitions. A network partition occurs when communication between some nodes is lost while the nodes themselves are still running.

The critical insight is that network partitions are not optional in distributed systems.

In any real distributed system, partitions will happen.

Since you must tolerate partitions (P is not negotiable), the real choice is between consistency and availability during a partition.

During a network partition, you must choose: respond with potentially stale data (sacrificing consistency for availability) or refuse to respond until the partition heals and nodes resynchronize (sacrificing availability for consistency).

When no partition exists, you can have both consistency and availability. The CAP theorem only forces a trade-off during partitions.

CP Systems vs. AP Systems: Real-World Examples

CP systems prioritize consistency over availability during partitions. When a partition occurs, CP systems refuse to serve requests that might return stale data. They either return an error or wait for the partition to heal.

ZooKeeper is a CP system. It is used for distributed coordination (leader election, configuration management, distributed locks) where correctness is more important than availability. If a ZooKeeper node is partitioned from the majority, it stops serving reads rather than risk returning stale coordination data, because a stale lock or stale leader information could cause catastrophic errors in the systems depending on it.

MongoDB (with majority read concern and majority write concern) behaves as a CP system. Writes are acknowledged only when the majority of replicas confirm, and reads return data that has been committed to a majority. During a partition where the majority is unreachable, the minority partition cannot process writes.

etcd and Consul are also CP systems, using Raft consensus to ensure all nodes agree on the state before responding.

AP systems prioritize availability over consistency during partitions. When a partition occurs, AP systems continue serving requests, but different nodes may return different (stale) data.

Cassandra is a classic AP system. Every node can accept reads and writes, even during partitions. If a partition splits the cluster, nodes on both sides continue operating independently. When the partition heals, Cassandra reconciles the conflicting data using configurable conflict resolution strategies (last-write-wins by default). Cassandra can be tuned toward consistency (by requiring reads and writes from a quorum of nodes), but its default and strongest mode is high availability.

DynamoDB defaults to eventual consistency (AP behavior) but offers a "strongly consistent read" option that behaves more like CP for individual read operations.

DNS is an AP system. If a DNS server cannot reach the authoritative name server, it serves its cached (potentially stale) answer rather than returning an error. Stale DNS data is better than no DNS resolution at all.

System	CAP Classification	Consistency Behavior	Availability Behavior
ZooKeeper	CP	Always consistent (majority quorum)	Unavailable during minority partition
etcd	CP	Raft consensus, linearizable reads	Unavailable without majority
MongoDB (majority)	CP	Majority write/read concern	Minority partition cannot write
Cassandra	AP	Eventual (tunable)	Always available, even during partitions
DynamoDB (default)	AP	Eventual consistency	Always available
DNS	AP	Stale data possible	Always responds

PACELC Theorem: Latency Extension of CAP

The CAP theorem only describes behavior during partitions. But most of the time, your system is not partitioned.

The PACELC theorem extends CAP by asking: when there is no partition, what trade-off does the system make between latency and consistency?

PACELC reads as: "if Partition, choose Availability or Consistency; Else, choose Latency or Consistency."

PA/EL systems (like Cassandra and DynamoDB) choose availability during partitions and low latency during normal operation. Writes are fast because they do not wait for all replicas to acknowledge. This gives you the fastest, most available system but with weaker consistency guarantees all the time, not just during partitions.
PC/EC systems (like ZooKeeper and traditional relational databases with synchronous replication) choose consistency during partitions and consistency during normal operation. Writes wait for confirmation from replicas even when there is no partition, which adds latency but ensures every read returns the latest write.
PA/EC systems (like MongoDB with certain configurations) choose availability during partitions but consistency during normal operation. When everything is healthy, reads and writes go through the majority and are consistent. When a partition occurs, the system favors availability.

PACELC is more useful than CAP for making real design decisions because it captures the trade-off you face every day (latency vs. consistency), not just the trade-off during rare partition events.

Classification	During Partition	During Normal Operation	Example Systems
PA/EL	Availability	Low latency	Cassandra, DynamoDB
PC/EC	Consistency	Consistency (higher latency)	ZooKeeper, Spanner
PA/EC	Availability	Consistency	MongoDB (majority concern)

Interview-Style Question

Q: You are building a global e-commerce platform. Product catalog reads need to be fast worldwide. Inventory counts need to be accurate during checkout. How do you reconcile these conflicting requirements with the CAP theorem?

A: Use different consistency models for different operations. The product catalog is a read-heavy, latency-sensitive workload where eventual consistency is acceptable. Serve it from an AP system (like Cassandra or DynamoDB) with replicas in every region. A product description that is 2 seconds stale is harmless. Inventory counts during checkout are a correctness-critical operation where strong consistency is required. Use a CP approach: check inventory against a single source of truth (the primary database) at checkout time, not against a replicated cache. Accept the slightly higher latency of a cross-region read for this one critical operation. The rest of the browsing experience stays fast and available. This is the standard pattern: different consistency models for different parts of the same system, chosen based on the cost of being wrong.

KEY TAKEAWAYS

The CAP theorem says a distributed system can guarantee at most two of three: Consistency, Availability, and Partition Tolerance. Since partitions are inevitable, the real choice is between C and A during a partition.
CP systems (ZooKeeper, etcd, MongoDB with majority concern) refuse to serve potentially stale data during partitions. AP systems (Cassandra, DynamoDB, DNS) continue serving requests but may return stale data.
PACELC extends CAP to include the latency-vs-consistency trade-off during normal operation. PA/EL systems are the fastest. PC/EC systems are the most consistent.
Most production systems use different consistency models for different operations within the same application. Optimize for speed where staleness is harmless, and for correctness where it is not.

Distributed Consensus

When multiple nodes in a distributed system need to agree on a value, an order of operations, or which node is the leader, they need a consensus algorithm.

Consensus is the mechanism that makes strong consistency possible in a distributed system, and getting it right is one of the hardest problems in computer science.

The Problem of Distributed Agreement

Imagine three database nodes that need to agree on which write to apply.

Node A receives a write setting the price to $10.

Node B receives a write setting it to $15.

Node C needs to agree with one of them.

If A and B cannot communicate directly (partition), how do they agree on the correct price? What if A crashes before telling anyone about its write? What if messages between nodes arrive in a different order?

These are not hypothetical edge cases. They happen daily in any distributed system. Without a consensus protocol, nodes can diverge permanently, leading to data corruption and split-brain scenarios where two nodes both believe they are the leader and both accept conflicting writes.

Consensus algorithms solve this by establishing a protocol that guarantees all functioning nodes agree on the same value, even in the presence of failures and network delays.

The classic requirements are agreement (all non-faulty nodes decide on the same value), validity (the decided value was proposed by some node), and termination (all non-faulty nodes eventually reach a decision).

Paxos Algorithm

Paxos, invented by Leslie Lamport in 1989, was the first proven consensus algorithm. It guarantees that a group of nodes can agree on a single value even if some nodes fail or messages are delayed.

Paxos works in two phases.

In the first phase (prepare), a node that wants to propose a value (the proposer) sends a prepare request with a unique proposal number to a majority of nodes (acceptors).

If an acceptor has not seen a higher proposal number, it promises not to accept any older proposals and replies with any value it has already accepted. In the second phase (accept), the proposer sends the actual value to the acceptors.

If a majority accepts, consensus is reached and the value is committed.

Paxos is provably correct, which is why it is studied extensively. It is also notoriously difficult to understand, implement, and optimize. Lamport himself titled his original paper "The Part-Time Parliament" and used an allegory about Greek legislators, which did not help clarity.

The algorithm's complexity and the difficulty of building practical systems on top of it led to the development of simpler alternatives.

Google's Chubby lock service and their internal systems used Paxos-based consensus for years. But most engineers today work with Raft instead, which achieves the same guarantees with a more understandable design.

Raft Consensus Algorithm

Raft was designed explicitly to be understandable. Its creators, Diego Ongaro and John Ousterhout, published it in 2014 as an alternative to Paxos that is equivalent in safety and efficiency but far easier to implement correctly.

Raft divides consensus into three sub-problems: leader election, log replication, and safety.

Leader election: One node is elected leader, and all client requests go through the leader. If the leader fails (detected by a timeout on heartbeat messages), the remaining nodes hold an election. Each node can vote for one candidate per term. The candidate that receives votes from a majority becomes the new leader. Terms are numbered sequentially so nodes can detect stale leaders.
Log replication: The leader receives client writes and appends them to its log. It then replicates the log entry to follower nodes. Once a majority of nodes have confirmed they have the entry, the leader commits it and responds to the client. Followers apply committed entries to their state machines in log order.
Safety: Raft guarantees that once a log entry is committed, it will be present in the logs of all future leaders. A candidate can only win an election if its log is at least as up-to-date as the majority of nodes. This prevents a node with a stale log from becoming leader and overwriting committed data.

Raft is the consensus algorithm behind etcd (used by Kubernetes for cluster coordination), Consul (HashiCorp's service discovery and configuration tool), and CockroachDB (a distributed SQL database). If you study one consensus algorithm for system design interviews and for resources like Grokking the System Design Interview, make it Raft.

Zab (ZooKeeper Atomic Broadcast)

Zab is the consensus protocol used by Apache ZooKeeper. It is similar to Raft in structure but was developed independently and predates Raft by several years.

Zab also uses a leader-based approach.

The leader receives all write requests, assigns them a sequential transaction ID, and broadcasts them to followers.

Once a majority of followers acknowledge, the transaction is committed.

If the leader fails, the remaining nodes elect a new leader and synchronize their state to ensure no committed transactions are lost.

The key difference between Zab and Raft is in recovery.

Zab has a dedicated recovery phase after a leader election where the new leader synchronizes with all followers to ensure they agree on the committed history before any new transactions are accepted.

Raft handles this through its log matching properties and does not have a separate recovery phase.

In practice, you will interact with Zab indirectly through ZooKeeper rather than implementing it yourself.

ZooKeeper is widely used for distributed coordination in Hadoop, Kafka (older versions), and other big data systems.

Leader Election Patterns

Leader election is a fundamental building block that uses consensus.

When multiple instances of a service are running and only one should perform a specific action (like running a scheduled job, managing a shared resource, or coordinating a workflow), they need to elect a leader.

The simplest pattern uses an external coordination service.

All instances attempt to create a lock (an ephemeral node in ZooKeeper, a key with a TTL in etcd, or a lock key in Redis).

The instance that succeeds becomes the leader. Other instances watch the lock and attempt to acquire it if the leader releases it or fails.

ZooKeeper uses ephemeral nodes that automatically disappear when the creating client's session expires (due to crash or network loss). Other clients watch the node and compete for leadership when it disappears.
etcd uses lease-based keys. A leader creates a key with a lease (TTL). It must periodically refresh the lease. If the leader crashes and stops refreshing, the lease expires, the key disappears, and another instance acquires leadership.
Redis can serve as a simpler (but less robust) leader election mechanism using SET with NX (set if not exists) and EX (expiration). The Redlock algorithm attempts to make Redis-based distributed locks safer by requiring locks across multiple independent Redis instances.

The critical property of any leader election mechanism is that exactly one leader exists at any given time.

If two nodes both believe they are the leader (split-brain), they can make conflicting decisions that corrupt data. Fencing tokens (monotonically increasing numbers assigned to each leader) help detect stale leaders: if a node presents a fencing token lower than the latest one, its actions are rejected.

Distributed Locks and Coordination (ZooKeeper, etcd, Consul)

Distributed locks extend leader election to general-purpose resource coordination. Any time multiple processes need exclusive access to a shared resource (a database record, a file, a billing run), a distributed lock ensures only one process proceeds at a time.

ZooKeeper is the most battle-tested coordination service. It provides primitives for distributed locks, leader election, configuration management, and group membership. Its ephemeral and sequential nodes make implementing locks and queues relatively straightforward. The downside is operational complexity: ZooKeeper requires a dedicated cluster (typically 3 or 5 nodes) with its own maintenance and monitoring.
etcd is a simpler, more modern alternative. It provides a key-value store with strong consistency (Raft-based), leases for TTL-based locks, and a watch API for change notifications. etcd is the coordination backbone of Kubernetes and has a smaller operational footprint than ZooKeeper.
Consul by HashiCorp combines coordination with service discovery and health checking. It provides a key-value store, distributed locks (sessions), and service mesh capabilities. Consul is a good choice when you need coordination and service discovery in one tool.

Tool	Consensus Protocol	Primary Use	Strengths
ZooKeeper	Zab	Coordination, locks, leader election	Mature, proven at massive scale (Kafka, Hadoop)
etcd	Raft	Kubernetes coordination, config, locks	Simpler operations, modern API, Kubernetes-native
Consul	Raft	Service discovery, coordination, mesh	Multi-datacenter, combined discovery + coordination

Interview-Style Question

Q: You have a distributed system where a scheduled billing job must run exactly once per day. Multiple instances of the billing service are running for redundancy. How do you ensure only one instance runs the job?

A: Use leader election through a distributed lock. Each billing service instance attempts to acquire a lock in etcd (or ZooKeeper) when the scheduled time arrives. The instance that acquires the lock becomes the leader and runs the billing job. Other instances fail to acquire the lock and skip the run. The lock has a TTL slightly longer than the expected job duration. If the leader crashes mid-job, the lock expires, and another instance can acquire it on the next scheduled run. To prevent duplicate processing if the leader crashes partway through, make the billing job idempotent: track which accounts have been billed for the current period and skip any that are already processed. After the job completes, the leader releases the lock explicitly.

KEY TAKEAWAYS

Distributed consensus ensures multiple nodes agree on a value or leader even in the presence of failures. It is the mechanism behind strong consistency.
Paxos was the first consensus algorithm but is notoriously difficult to implement. Raft achieves the same guarantees with a clearer design and is the standard for modern systems.
Zab powers ZooKeeper and follows a leader-based approach similar to Raft with a dedicated recovery phase.
Leader election ensures exactly one node performs a specific action. Use ZooKeeper, etcd, or Consul for production leader election. Guard against split-brain with fencing tokens.
Distributed locks provide exclusive access to shared resources. Always set TTLs on locks to prevent deadlocks from crashed holders, and make the protected operation idempotent to handle edge cases.

Distributed Transactions

A transaction in a single database is straightforward: group operations together, and either all succeed or all roll back (ACID). But when a single logical operation spans multiple databases, multiple services, or multiple microservices, you need a distributed transaction that coordinates commits across independent systems.

Distributed transactions are hard. Network failures, node crashes, and timing issues can leave the system in a state where some participants committed and others did not.

The patterns in this section address that challenge with different trade-offs between consistency, availability, and complexity.

Two-Phase Commit (2PC)

Two-phase commit is the classic protocol for distributed transactions. It uses a coordinator that orchestrates the transaction across all participants.

Phase 1 (Prepare): The coordinator sends a prepare request to every participant. Each participant executes the transaction locally, writes the changes to a log (but does not commit yet), and responds with either "yes, I can commit" or "no, I cannot."

Phase 2 (Commit or Abort): If all participants voted yes, the coordinator sends a commit message. Each participant finalizes the transaction and acknowledges. If any participant voted no, the coordinator sends an abort message, and all participants roll back.

2PC guarantees atomicity across all participants: either everyone commits or everyone aborts. But it has a critical flaw.

If the coordinator crashes after Phase 1 but before sending the Phase 2 decision, all participants are stuck in a "prepared" state. They have locked resources (rows, tables) and cannot release them until they learn the coordinator's decision. This blocking behavior can freeze parts of your system indefinitely until the coordinator recovers.

2PC also creates a single point of failure (the coordinator), adds significant latency (two round trips across all participants), and reduces throughput because resources are locked during the entire protocol.

For these reasons, 2PC is rarely used across microservices. It is primarily used within tightly coupled databases that support distributed transactions natively (like PostgreSQL with prepared transactions or XA transactions in Java).

Three-Phase Commit (3PC)

Three-phase commit adds an extra phase to address 2PC's blocking problem. Between the prepare and commit phases, 3PC inserts a pre-commit phase where participants acknowledge they are ready to commit.

If the coordinator fails, participants can examine their state and decide independently: if they received a pre-commit message, they know the coordinator intended to commit.

If they only received a prepare message, they know the coordinator had not decided yet and can safely abort.

In theory, 3PC is non-blocking.

In practice, it adds complexity and additional round trips, and it still fails in certain edge cases involving network partitions.

3PC is rarely used in production systems. It is more of an academic improvement over 2PC than a practical solution.

Most modern systems use the Saga pattern instead.

Saga Pattern: Choreography vs. Orchestration

The Saga pattern is the practical answer to distributed transactions in microservices.

Instead of trying to make multiple services commit atomically (which is fragile and slow), a Saga breaks the transaction into a sequence of local transactions, each owned by a single service.

If one step fails, the Saga executes compensating transactions to undo the work of previous steps.

Consider an order placement that involves three services: the payment service charges the customer, the inventory service reserves the item, and the shipping service schedules delivery.

In a Saga, each step is a local transaction.

The payment service charges the card. If that succeeds, the inventory service reserves the item.

If that succeeds, the shipping service schedules delivery.

If the shipping step fails, the Saga runs compensating transactions: unreserve the item in inventory and refund the payment. Each compensation undoes the effect of a previous step.

Sagas can be coordinated in two ways.

Choreography (event-driven). Each service publishes an event when its local transaction completes. The next service listens for that event and performs its step. There is no central coordinator. The payment service charges the card and publishes "PaymentCompleted." The inventory service hears "PaymentCompleted" and reserves the item, publishing "ItemReserved." The shipping service hears "ItemReserved" and schedules delivery.

Choreography is decentralized and loosely coupled. No service knows about the others directly.

The downside is that the flow is implicit. Understanding the complete transaction requires tracing events across multiple services and their event handlers.

As the number of steps grows, the event flow becomes harder to reason about and debug.

Orchestration (coordinator-driven). A central orchestrator service defines the Saga's steps and tells each service what to do. The orchestrator calls the payment service, then the inventory service, then the shipping service. If a step fails, the orchestrator calls the compensating transactions in reverse order.

Orchestration makes the flow explicit and easy to understand.

The orchestrator contains the complete transaction logic in one place. The downside is that the orchestrator is a single point that needs to be reliable, and it introduces coupling between the orchestrator and all participating services.

Approach	Coordination	Coupling	Visibility	Complexity
Choreography	Decentralized (events)	Low (services independent)	Low (implicit flow across events)	Harder to trace and debug
Orchestration	Centralized (orchestrator)	Higher (orchestrator knows all services)	High (explicit flow in one place)	Easier to understand but orchestrator is critical

Most teams start with orchestration for transactions that have a clear linear flow (order placement, booking, onboarding). Choreography works better for loosely related reactions where the order does not strictly matter (notifications, analytics, audit logging).

Compensating Transactions

Compensating transactions are the undo operations in a Saga. They reverse the effect of a previously completed step. This sounds simple, but designing correct compensations is one of the hardest parts of the Saga pattern.

Some operations are easy to compensate.

A payment charge can be refunded. An inventory reservation can be released. A database insert can be deleted.

Other operations are difficult or impossible to compensate.

A sent email cannot be unsent.

A published notification cannot be unread.

An external API call to a third-party service might not support reversal.

For operations that cannot be truly undone, compensating transactions often involve corrective actions instead: sending a follow-up email ("Sorry, your order was cancelled"), issuing a credit instead of a refund, or creating an adjustment record that offsets the original action.

Key design principles for compensations include making every compensating transaction idempotent (since it might be retried), logging every step and compensation for audit purposes, handling partial failures within compensations themselves (what if the refund call fails?), and setting a maximum retry count before escalating to manual intervention.

The Saga pattern accepts eventual consistency rather than trying to force atomic consistency across services.

The system might be temporarily in an inconsistent state (payment charged but item not yet reserved) while the Saga is in progress. Designing each step and its compensation carefully ensures the system always reaches a consistent state eventually, either the complete transaction or a fully compensated rollback.

Beginner Mistake to Avoid

New engineers sometimes try to implement distributed transactions using 2PC across microservices, locking resources in multiple databases and coordinating commits over the network. This works in demos but fails in production.

Network timeouts cause indefinite locks.

Coordinator failures leave transactions in limbo. Latency makes the entire flow slow. If you are building microservices and need distributed transactions, use the Saga pattern.

If you need true atomic transactions, that is a signal that the data should probably live in a single database rather than being split across services.

Interview-Style Question

Q: A user books a vacation package that involves three separate services: flights, hotels, and car rentals. The car rental service fails after flights and hotels are booked. How do you handle this?

A: Use a Saga with an orchestrator. The orchestrator coordinates the booking flow: first book the flight, then the hotel, then the car rental. When the car rental fails, the orchestrator triggers compensating transactions in reverse order. It cancels the hotel reservation (calling the hotel service's cancel endpoint) and cancels the flight reservation (calling the flights service's cancel endpoint). Each compensation is idempotent so retries are safe. The orchestrator logs every step and compensation for audit purposes. If a compensation itself fails (the hotel cancel API is down), the orchestrator retries with exponential backoff and eventually alerts a human operator if all retries are exhausted. The user sees a clear message: "We could not complete your booking. Any charges have been reversed." The system is always in a consistent state, either fully booked or fully cancelled.

Saga Orchestration Flow for a Vacation Booking

KEY TAKEAWAYS

Two-phase commit (2PC) provides atomic distributed transactions but is blocking, slow, and fragile. It suits tightly coupled databases, not microservices.
Three-phase commit (3PC) reduces blocking but adds complexity and is rarely used in production.
The Saga pattern breaks distributed transactions into local transactions with compensating actions for rollback. It is the standard approach for microservices.
Choreography uses events for decentralized coordination. Orchestration uses a central coordinator. Choose orchestration for clear linear flows and choreography for loosely coupled reactions.
Compensating transactions must be idempotent, carefully designed for operations that cannot be truly undone, and equipped with retry and escalation logic.
If you need atomic transactions across multiple data stores, reconsider whether that data should be split across services in the first place.