OperabilityFeature Flags

Feature Flag Kill Switch

Deploying code to production is irreversible in the short term - when something goes wrong, rolling back requires another deploy, which takes time and may have its own risks.

Rickvian Aldi·Software engineer·5 min read

Problem

Every deploy is a bet. You have tested in staging, reviewed the code, and run the canary - and still, production finds the bugs that staging misses. When a newly deployed feature causes latency spikes, error surges, or data corruption, the right move is immediate rollback. But rollback means another deploy: cut a revert commit, merge it, wait for CI, deploy, wait for pods to cycle. In a fast-moving system this might take 10–30 minutes. In a payments or authentication path, 10 minutes of elevated errors is expensive.

The deeper problem is that code deployment and feature availability are treated as the same event. They don't have to be.

Forces

  • Deploys are slow and risky. Every deploy is an opportunity for a different bug to be introduced. A rollback commit is just another change in a bad moment.
  • Dark launching vs. feature launching are different risks. Shipping code that is never executed is low risk. Flipping a flag to execute that code is higher risk but much faster to reverse.
  • Percentage rollouts. New features often need to be shown to 1%, then 5%, then 50%, then 100% of traffic - not deployed to 1% of servers. Flag-driven rollout gives you this control at runtime without infrastructure changes.
  • Cross-service coordination. Multiple services sometimes need to be updated together. Deploying them all with new code behind a flag, then flipping the flag when all services are ready, avoids the window where service A has the new behavior and service B does not.

Solution

Separate the act of deploying code from the act of enabling behavior. New code ships behind a flag check. A configuration store (a database table, a dedicated service like LaunchDarkly or Unleash, or a simple Redis key) determines whether the flag is enabled for a given user, cohort, or globally. To roll back, flip the flag - no deploy required.

Minimal implementation:

// flags.ts - thin wrapper around your flag store
export async function isEnabled(
  flag: string,
  context: { userId?: string; env?: string }
): Promise<boolean> {
  const rule = await flagCache.get(`flag:${flag}`, () =>
    db.flags.findByName(flag)
  );
  if (!rule || !rule.enabled) return false;
  if (rule.allowList?.includes(context.userId ?? '')) return true;
  if (rule.percentage != null) {
    const hash = murmurhash3(`${flag}:${context.userId ?? 'anon'}`);
    return (hash % 100) < rule.percentage;
  }
  return true;
}

At the call site:

async function processPayment(order: Order, user: User): Promise<PaymentResult> {
  if (await isEnabled('new-payment-flow', { userId: user.id })) {
    return newPaymentFlow(order, user);
  }
  return legacyPaymentFlow(order, user);
}

The kill switch variant is the simplest form: a single boolean flag that globally disables a feature. When an alert fires, an engineer flips the flag in the admin panel and the feature is off within seconds (bounded by TTL on the cache layer).

A feature flag gives you the ability to make a decision in production that you cannot make in code review. That is its real value.

Flag lifecycle discipline: Flags are technical debt. Every flag is a branch in every code path it touches. Agree upfront on an expiry plan: after the feature has rolled out to 100% and been stable for N weeks, delete the flag and its associated old code path. Accumulated flags make the codebase hard to reason about and tests expensive to maintain.

Schema pattern for a homegrown flag store:

CREATE TABLE feature_flags (
  name        TEXT PRIMARY KEY,
  enabled     BOOLEAN NOT NULL DEFAULT false,
  percentage  INTEGER CHECK (percentage BETWEEN 0 AND 100),
  allow_list  TEXT[],
  updated_at  TIMESTAMPTZ DEFAULT now()
);

When NOT to Use

  • Permanent configuration. If the behavior controlled by the flag will never need to vary by user or be turned off after launch, it is not a flag - it is a constant. Use an environment variable or a config file.
  • High-cardinality decisions. Flags work well for O(1) to O(N users) decisions. Using flags to control the behavior of every record in a database (millions of items) is better handled by a schema migration or a column on the record itself.
  • Secret or security-sensitive branching. Feature flags are often readable by engineers and operators. If the branch being controlled is security-sensitive (e.g., bypassing an auth check in an emergency), the flag store must have strong access controls and an audit log. Most lightweight flag stores do not provide this.
  • Performance-critical hot paths without caching. If your flag check is in the innermost loop of a high-throughput path and you haven't cached the flag evaluation, the flag lookup becomes a latency contributor. Always cache flag reads.

Read-Through Cache is essential alongside feature flags: flag evaluation must be fast enough to add to every request. Caching flag state with a short TTL (1–5 seconds) keeps evaluation sub-millisecond and avoids hammering the flag store. The two patterns together produce a safe, fast feature deployment system.

References

  • Fowler, Martin. "Feature Toggle (aka Feature Flag)." martinfowler.com/articles/feature-toggles.html
  • LaunchDarkly. "Feature Flag Best Practices." launchdarkly.com/blog/feature-flag-best-practices/
  • Hodgson, Pete. "Feature Toggles (Feature Flags)." martinfowler.com/articles/feature-toggles.html
  • Accelerate: The Science of Lean Software and DevOps. Forsgren, Humble, Kim. IT Revolution, 2018. Chapter 4 (Technical Practices).

Related patterns

Read-Through Cache

Fetching the same data from a slow database on every request wastes latency and database capacity, but manually managing cache population across every call site is error-prone and hard to keep consistent.

cachingperformancedatabaselatencyread-heavy

Get essays in your inbox

Practical deep-dives on software craft, career leverage, and building things that matter. No noise.