Meta-Stable Failure: When Your System Is Up But Completely Down
The most dangerous distributed systems failures are the ones where everything looks fine, until it doesn't. Here's the failure mode that buries on-call engineers.
There's a failure mode that will ruin your week. Not because it's loud, it won't be. Your services stay up. Metrics look mostly reasonable. Alerts don't fire immediately. And yet your system is producing no useful output. Users are getting errors. Revenue is going to zero. And your dashboards are telling you everything is fine.
This is a meta-stable failure. It's the failure mode that keeps distributed systems engineers up at night precisely because it doesn't look like a failure.
This is first time i heard the term "meta-stable failure.", it spark curiosity in me and dig down deeper.
Up, But Down
The formal definition, from a 2021 HotOS paper co-authored by engineers at AWS, Meta, and others: "Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed."
Read that last clause again. Even when the trigger is removed.
This is what separates meta-stable failures from ordinary outages. In a normal incident, you find the cause, fix it, and the system recovers. In a meta-stable failure, you fix the cause and the system stays broken. The failure has become self-sustaining. You're not fighting the original trigger anymore, you're fighting the system's own behavior.
The Sustaining Loop
To understand why, you need to understand what's actually happening mechanically.
Imagine a backend service under normal load. A routine deployment causes a brief latency spike. Clients, following best practices, retry their failed requests. But those retries arrive while the service is already struggling, which increases load further. The higher load causes more failures. More failures cause more retries. More retries cause more load.
The original trigger, the deployment, is long gone. The latency spike from the deploy resolved in 30 seconds. But the feedback loop it kicked off is still running, hours later, at full intensity. Your service is extremely busy processing retries for requests it will never successfully complete. It has high throughput. It has near-zero goodput, useful work delivered to users.
The system is "up." In a meaningful sense, it's completely down.
Why Good Engineering Made It Worse
Here's the uncomfortable part. Every mechanism in that loop was put there deliberately, by thoughtful engineers, for good reasons.
Retries exist because transient failures are real. Networks drop packets. A request that fails once often succeeds on the second attempt. Without retries, users see errors from hiccups that the system would have handled fine with one more chance.
Timeouts exist because hanging requests are worse than failed ones. A request that waits forever holds resources, threads, connections, memory, that block other requests. Timeouts free those resources so the system can keep serving other traffic.
Backoff is supposed to prevent retry storms, and in mild overload scenarios it works well. But in deep overload, even exponential backoff with jitter doesn't reduce enough load to allow recovery. The system is so far from its recovery threshold that even well-behaved clients can't get it there.
Each of these was the right decision in isolation. The meta-stable failure emerges from their interaction under specific conditions.
The Trigger Is Not The Problem
This is the key insight that changes how you think about these failures, and the one that gets buried when teams do post-mortems.
A meta-stable failure has two distinct phases: a trigger that pushes the system into the bad state, and a sustaining mechanism that keeps it there. Engineers almost always focus their post-mortem on the trigger, the deployment, the traffic spike, the upstream timeout. They fix it and declare the incident closed.
But the trigger is almost incidental. A different trigger, a thundering herd, a cache invalidation, a brief network partition, would have caused exactly the same outcome. The system has a stability property that makes it vulnerable: under certain load conditions, a feedback loop can lock it into a degraded state it cannot escape from on its own.
Fix the trigger and you've prevented this incident. Fix the sustaining mechanism and you've changed the system's fundamental failure behavior.
Four Places Sustaining Loops Hide
The HotOS paper identifies common sources. These aren't exotic, they're patterns you've almost certainly shipped.
Retry logic without load shedding. Retries increase load during overload. If your service doesn't have a mechanism to start rejecting work, actively, deliberately, with a 503, it has no way to reduce its own load. The retry traffic fills every available slot. The only escape is an operator manually cutting traffic.
Caching with thundering herds. A cache miss causes a backend query. Under normal load, this is fine, misses are occasional and the backend handles them. But a cache eviction or flush can cause every node to miss simultaneously, sending a spike to the backend that exceeds its capacity, causing timeouts, causing retries, filling the queue. The cache can't repopulate if the backend can't respond, so the thundering herd sustains itself.
Slow error paths. Your fast path is highly optimized. Your error path, the code that runs when something goes wrong, was written once and never touched again. If error handling is significantly slower than normal processing, failures consume more resources per request than successes. At high error rates, this means the system is burning capacity on work that produces no value. The more it fails, the less capacity it has to succeed.
Autoscaling with lag. Cloud autoscaling is not instantaneous. There's a window, sometimes several minutes, between when load exceeds capacity and when new instances are ready. During that window, if the load is sustained by a feedback loop, the existing instances may saturate before help arrives. New instances that join an already-overloaded cluster can get immediately hammered and fail to start up successfully at all.
What Recovery Actually Looks Like
If you're in a meta-stable failure and you remove the trigger, the system won't recover. It's locked in the bad state. You need to do something that breaks the sustaining loop.
The most common effective interventions:
Reduce load below the recovery threshold. If retries are sustaining the loop, stop the retries. Cut client traffic at the load balancer. Drop the offered load far enough that the service can start clearing its backlog. This feels wrong, you're making the user-visible failure more severe, but it's the only way to let the system breathe.
Restart the sustaining mechanism. If cache state is part of the loop, a controlled warm-up sequence (where you rebuild cache before allowing full traffic) can break the cycle. If connection pool exhaustion is sustaining the failure, recycling connections on the server side can free the system.
Emergency throttling. A hard rate limit, not a soft one, that doesn't grow under retry pressure. Circuit breakers that stay open long enough for backend queues to drain, not just long enough to let one request through.
How to Build Systems That Don't Get Trapped
The sustaining loop is the target, not the trigger. That reframes what resilience engineering actually means for this failure mode.
Give your services a way to shed load. Every service that accepts open-loop traffic needs a pressure valve. This means returning 503 when load exceeds a threshold, not queueing indefinitely. Queues that grow without bound under load are a meta-stable failure waiting to happen. They accept work the service will never complete, holding capacity that could be freed.
Treat your error paths with the same care as your fast paths. Profile them. Load test them. If your error handling is two orders of magnitude slower than normal processing, it will become a resource sink at exactly the wrong time.
Design for controlled recovery. The hardest operational moment in a meta-stable failure is the transition back to normal. Load needs to ramp up slowly, too fast, and the recovering system re-enters the bad state. Traffic ramp-up procedures, caching warm-up sequences, and connection pre-warming aren't nice-to-haves; they're the mechanism that prevents re-entry.
Think in feedback loops, not component failures. Most distributed systems reliability work treats failure as: component X failed, what is our tolerance for X failing? Meta-stable failures are a different animal. They emerge from the interaction of functioning components. The question isn't "what happens when X fails" but "what happens to the system's dynamics when load pushes past a threshold." Control theory, not traditional reliability engineering, is the right mental model.
The Honest Thing About Post-Mortems
When you've been through a meta-stable failure and you're writing the post-mortem, the trigger makes for a satisfying story. There was a deploy. There was a traffic spike. Here's what we did wrong, here's the fix, here's the test to prevent it next time.
But if you stop there, you've documented a specific trigger while leaving the underlying failure mode intact. The next time, different trigger, same catastrophic loop, same hours of degradation.
The meta-stable failure asks a harder question: not "what went wrong" but "what property of this system makes it capable of getting trapped like this?" That question doesn't have a clean action item. It has architectural answers that take quarters to implement. It requires reasoning about your system's dynamics under load, not just its behavior under component failure.
That's the harder work. And it's the only work that actually changes the failure mode.
reference video: https://www.youtube.com/watch?v=u3GjIXP9N0s
Related essays
Thundering Herd: What a 10-Minute Video Taught Me About Retries
I thought I understood API retries. Then I watched Arpit Bhayani explain the thundering herd problem, and realized every retry I'd ever written was either part of the fix - or part of the fire. testing
Why Engineers Are Obsessed With P99
If you only watch the average, you are watching the wrong number. P99 is where the money leaks, where the outages start, and where your users quietly decide to leave. testing
The Pattern Worth Paying For
Idempotency is the single most underrated contract in distributed systems - and ignoring it is how you end up charging customers twice at 3am. testing