ConsistencyMessaging

Transactional Outbox

Events published directly inside a database transaction can be lost if the broker is unavailable, leaving the database and downstream consumers permanently out of sync.

Rickvian Aldi·Software engineer·5 min read

Problem

In any system that combines a relational database with an event bus, there is a two-phase commit problem hiding in plain sight. A service needs to write a row to its database and publish an event to notify other services - but these are two separate systems with no shared transaction boundary. If the event is published after the database commit and the broker is momentarily unreachable, the event is lost. If the event is published before the commit and the commit fails, the event fires for something that never happened. Either way the system ends up inconsistent.

This is not a theoretical concern. It surfaces in order processing (order created, payment not charged), inventory updates (stock decremented, warehouse not notified), and user lifecycle events (account created, welcome email never sent). The failure window is small but the impact is large, and it grows worse under load when retry storms amplify partial failures.

Forces

Atomicity vs. reach. A single database transaction can guarantee "either both writes happen or neither does," but that guarantee extends only to systems that share the same transaction log. An external broker is by definition outside that boundary.
Broker unavailability is normal. Network partitions, broker restarts, and brief connection drops happen in production. A design that requires the broker to be up at commit time is fragile.
At-least-once is easier than exactly-once. Distributed systems can generally deliver at-least-once semantics without coordination. Exactly-once requires idempotent consumers regardless, so designing for at-least-once delivery and idempotent handlers is the pragmatic default.
Eventual consistency is acceptable. Most event-driven integrations tolerate a small lag between the database state change and the downstream notification. Strict real-time coupling is rarely a hard requirement.

Solution

Write the event to an outbox table in the same database transaction as the business record. A separate relay process - either a polling worker or a CDC (Change Data Capture) subscriber on the database's replication log - reads the outbox, publishes the event to the broker, and marks it as sent.

Schema pattern:

CREATE TABLE outbox_events (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_id TEXT NOT NULL,
  event_type   TEXT NOT NULL,
  payload      JSONB NOT NULL,
  created_at   TIMESTAMPTZ DEFAULT now(),
  sent_at      TIMESTAMPTZ
);

Within the application transaction:

await db.transaction(async (tx) => {
  await tx.insert(orders).values(newOrder);
  await tx.insert(outboxEvents).values({
    aggregateId: newOrder.id,
    eventType: 'order.created',
    payload: JSON.stringify(newOrder),
  });
});

The relay reads unsent rows on a short interval (or streams changes from Postgres logical replication / Debezium) and publishes them. On publish success it sets sent_at. On failure it retries. The business transaction never touches the broker.

Transactional Outbox - the event is written atomically with the business record; a relay process handles broker delivery separately

Messages that are part of a transaction should be stored in the same place as the transaction itself.

CDC vs. polling: CDC (using tools like Debezium with Kafka Connect) provides near-real-time relay with minimal database polling load. Polling is simpler to operate but introduces latency equal to the polling interval and adds read load to the primary. For most applications, polling on a 1–5 second interval is adequate.

Consumer idempotency: Because the relay delivers at-least-once, consumers must handle duplicate events gracefully. Store a processed event_id in the consumer's own database to deduplicate on receipt.

When NOT to Use

High-frequency, low-latency event streams. If you need sub-100ms end-to-end delivery and are publishing millions of events per second, the polling overhead and table growth of an outbox may become a bottleneck. Consider native broker transactions (Kafka exactly-once semantics) or a different architecture.
Simple, single-service deployments. If a service owns its data and has no downstream consumers, the outbox is unnecessary ceremony. Apply it at integration boundaries, not everywhere.
When your broker already provides transactional publish. Kafka's idempotent producer with enable.idempotence=true and transactional.id gives you atomic produce-within-a-transaction semantics. Combining that with a Kafka Streams topology can eliminate the outbox for Kafka-native architectures.
Read-heavy systems where writes are rare. The outbox adds a write amplification factor (one extra row per business write). In write-heavy systems this is negligible; in nearly-read-only systems it may be over-engineering.

The Outbox pattern pairs naturally with the Saga pattern: a saga orchestrates a multi-step distributed transaction and relies on reliable event delivery between steps - exactly what the outbox guarantees. It also works alongside the Idempotency Key pattern: the relay's at-least-once delivery makes idempotent consumers a requirement, not an option.

References

Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly, 2017. Chapter 11 (Stream Processing).
Richardson, Chris. "Pattern: Transactional Outbox." microservices.io/patterns/data/transactional-outbox.html
Debezium documentation. "Outbox Event Router." debezium.io/documentation/reference/transformations/outbox-event-router.html
Hohpe, G. and Woolf, B. Enterprise Integration Patterns. Addison-Wesley, 2003.

messaging distributed-systems consistency event-driven

Related patterns

Idempotency Key

Retrying a failed API request can trigger duplicate side effects - charging a card twice, creating two accounts, or sending the same email multiple times.

apisidempotencyreliabilitypaymentsdistributed-systems

Saga

A business transaction that spans multiple services cannot use a database transaction - there is no single database. Without coordination, partial failures leave the system in an inconsistent state with no automated recovery path.

distributed-systemstransactionsconsistencyevent-drivencompensation

Related patterns

Get essays in your inbox