ConsistencyDistributed Transactions

Saga

A business transaction that spans multiple services cannot use a database transaction - there is no single database. Without coordination, partial failures leave the system in an inconsistent state with no automated recovery path.

Rickvian Aldi·Software engineer·6 min read

Problem

Consider an e-commerce order flow: reserve inventory, charge the customer, notify the warehouse, send a confirmation email. In a monolith these four steps can live in one database transaction - either all succeed or all roll back. In a microservices architecture each step is owned by a different service with its own database. There is no global transaction coordinator. If the charge succeeds but the warehouse notification fails, inventory is reserved for an order that was never fulfilled. There is no automatic rollback. An engineer must discover the inconsistency and patch it by hand.

This is the distributed transaction problem. Two-phase commit (2PC) is the classical solution but it requires all participants to be synchronously available, introduces a coordinator that becomes a single point of failure, and performs poorly at scale. Most modern distributed systems need a better answer.

Forces

  • No global lock. Distributed systems explicitly trade global atomicity for availability and partition tolerance (see CAP theorem). A mechanism that requires all services to be up and responsive simultaneously is fragile.
  • Failure is partial and unpredictable. A service might crash after sending a response but before the caller receives it. A network timeout doesn't tell you whether the remote operation succeeded or failed. Any coordination mechanism must work in the face of this ambiguity.
  • Business operations have natural compensation. "Unreserve inventory," "issue a refund," and "cancel a shipment" are real business operations, not database artifacts. Modeling them explicitly is semantically honest and operationally useful.
  • Eventual consistency is often good enough. Users can tolerate a brief window where "order placed" and "inventory updated" are not simultaneously true, provided the system converges quickly and reliably.

Solution

A Saga decomposes a long-running business transaction into a sequence of local transactions, each of which publishes an event or sends a command to trigger the next step. If any step fails, compensating transactions are executed in reverse to undo completed steps.

There are two implementation styles:

Choreography (event-driven): Each service listens for events and reacts by executing its local transaction and publishing its own event. There is no central coordinator.

Choreography saga - each service publishes events that trigger the next step; no central coordinator
// inventory-service: listens for OrderCreated
eventBus.on('order.created', async (event: OrderCreated) => {
  const reservation = await db.reserveInventory(event.orderId, event.items);
  if (reservation.success) {
    await eventBus.publish('inventory.reserved', { orderId: event.orderId });
  } else {
    await eventBus.publish('inventory.reservation.failed', { orderId: event.orderId });
  }
});
 
// payment-service: listens for InventoryReserved
eventBus.on('inventory.reserved', async (event: InventoryReserved) => {
  const charge = await stripe.charge(event.orderId);
  if (charge.success) {
    await eventBus.publish('payment.charged', { orderId: event.orderId });
  } else {
    // Trigger compensation: tell inventory service to release the reservation
    await eventBus.publish('payment.failed', { orderId: event.orderId });
  }
});

Orchestration (command-driven): A central saga orchestrator sends commands to participant services and tracks state. The orchestrator handles failures by issuing compensation commands.

Orchestration saga - the orchestrator drives each step and issues compensation commands on failure
class OrderSaga {
  async execute(order: Order): Promise<void> {
    const sagaId = order.id;
    
    try {
      await this.inventoryService.reserve(sagaId, order.items);
      await this.paymentService.charge(sagaId, order.total);
      await this.warehouseService.fulfill(sagaId, order.items);
      await this.notificationService.confirm(sagaId, order.email);
    } catch (err) {
      // Compensation: undo in reverse order
      await this.warehouseService.cancel(sagaId).catch(() => {});
      await this.paymentService.refund(sagaId).catch(() => {});
      await this.inventoryService.release(sagaId).catch(() => {});
      throw err;
    }
  }
}

A saga is not a transaction. It is a protocol for maintaining consistency in a world where transactions don't span service boundaries.

Choreography vs. orchestration tradeoffs:

  • Choreography is simpler to deploy (no extra service) but harder to observe - the saga state is distributed across multiple services' logs.
  • Orchestration makes the happy path and failure paths explicit in one place and is easier to debug, but the orchestrator becomes a coupling point.
  • For sagas with more than three steps, orchestration pays for itself in debuggability.

State tracking: The orchestrator (or the choreography coordinator, if any) must persist saga state so it can resume after a crash. Store { sagaId, step, status, compensated } in a database table updated in the same transaction as each step.

When NOT to Use

  • Simple two-service integrations. If you only need to coordinate two services, a saga is heavy machinery. Consider a simpler request-response with an explicit retry and compensation call in the error handler.
  • When strong consistency is required. Sagas provide eventual consistency. If your domain requires that all participants are in a consistent state at every moment (financial ledgers, regulatory reporting), sagas alone are insufficient. You need additional constraints, reconciliation processes, or a rethink of service boundaries.
  • Monolithic applications. If the services involved share a database, use a database transaction. Sagas exist because databases can't span service boundaries - that's the only context where they add value.
  • Short-lived, low-stakes operations. If a failure in one step has no meaningful business consequence and the cost of compensation is higher than the cost of inconsistency, accept the inconsistency and clean it up with a reconciliation job.

Outbox is essential for reliable saga event delivery: each saga step must publish its completion event atomically with its local database write, or the saga will stall on a broker failure. The Idempotency Key pattern ensures that compensation steps and retries are safe to apply multiple times - a saga orchestrator will retry failed steps, and each participant must handle duplicate commands gracefully.

References

  • Garcia-Molina, H. and Salem, K. "Sagas." ACM SIGMOD 1987. The original paper introducing the pattern.
  • Richardson, Chris. Microservices Patterns. Manning, 2018. Chapter 4 (Managing Transactions with Sagas).
  • Richardson, Chris. "Pattern: Saga." microservices.io/patterns/data/saga.html
  • Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly, 2017. Chapter 9 (Consistency and Consensus).

Related patterns

Idempotency Key

Retrying a failed API request can trigger duplicate side effects - charging a card twice, creating two accounts, or sending the same email multiple times.

apisidempotencyreliabilitypaymentsdistributed-systems

Transactional Outbox

Events published directly inside a database transaction can be lost if the broker is unavailable, leaving the database and downstream consumers permanently out of sync.

messagingdistributed-systemsconsistencyevent-driven

Get essays in your inbox

Practical deep-dives on software craft, career leverage, and building things that matter. No noise.