Thundering Herd: What a 10-Minute Video Taught Me About Retries
I thought I understood API retries. Then I watched Arpit Bhayani explain the thundering herd problem, and realized every retry I'd ever written was either part of the fix - or part of the fire.
I watched a ten-minute video this week that rearranged the way I think about retries. It was Arpit Bhayani's walk-through of the thundering herd problem - one of those talks that is short, unhurried, and pointed enough to leave a mark. I had written retry logic dozens of times. I had read the AWS blog post about jitter. I thought I understood it. I did not.
The thing that got me was not any single technical fact. It was the small realization that I had been holding two incompatible ideas in my head at the same time - "retries make systems more reliable" and "retries cause outages" - without ever forcing them into the same room.
This article is me forcing them into the same room.
The Setup That Made It Click
Arpit opens with a scene that almost every backend engineer has lived through, even if they did not realize it. A service has a brief hiccup. Maybe the deploy took a few seconds longer than expected. Maybe a database failed over. Maybe a downstream dependency got slow. The clients, behaving exactly as their engineers intended, notice the failures and retry.
That is where the trouble starts.
Each client is a single, well-behaved program. None of them are misbehaving. None of them are attacking your service. They are doing the correct thing - retrying a failed request so that the end user sees success instead of an error. And yet, in aggregate, they form a wave of traffic that hits your already-struggling service at exactly the moment it least wants to see it.
The image Arpit uses is the one the name implies: a herd of animals, all waking up at the same moment and stampeding in the same direction. Individually, each one is just running. Collectively, they trample whatever is in the way.
That framing was new to me. I had always thought of the thundering herd as something that happened to a system. Arpit's point is that it is something the system's own clients produce, following instructions the system's own engineers wrote.
Retries Are Not Free
The first insight I took from the video is one I had technically known but never really internalized: every retry is a request. That sounds obvious. It is not.
A retry feels, from the client's perspective, like a second chance. The first attempt failed, so we try again, and if the second succeeds we are whole. No harm done. But the server does not see it that way. The server sees two requests, both of which it must process, both of which consume CPU and memory and a connection slot, even if the first one failed quickly.
When you multiply that by thousands of clients, the math changes. A service running at 80% capacity can handle its load. The same service, faced with every client sending twice as many requests as normal, is running at 160% of its capacity - except capacity does not work like that. It does not degrade linearly. Past some threshold, everything gets slower, which causes more timeouts, which causes more retries.
I had read about this. I had not understood it as my retry logic's fault.
Exponential Backoff, and the Trap Inside It
The standard answer to naive retries is exponential backoff. Wait one second, then two, then four, then eight. The rationale is clean: each retry gives the server more time to recover, and a client that keeps failing will exponentially reduce the load it offers.
This is what I had implemented, many times. I thought of it as the solved version of the problem.
What Arpit's video pointed out - and the thing I want to make sure lands, because this is the piece I had genuinely missed - is that exponential backoff alone does not solve the thundering herd. It only moves it.
Here is why. Suppose a thousand clients all hit a failure at the same moment. They all wait one second. They all retry at the same moment. They all fail. They all wait two seconds. They all retry at the same moment. They all fail. The waves of traffic are farther apart now, but each wave is still synchronized. The herd is still thundering - it is just thundering at regular intervals instead of continuously.
If your service cannot handle a thousand simultaneous requests, it does not matter whether those requests arrive every second or every minute. Each wave will saturate it. Each saturation will produce more failures. Each failure will trigger the next synchronized retry.
The quiet, uncomfortable insight is that well-behaved clients obeying well-written retry logic can still collectively DDoS the service they are trying to talk to.
Jitter Is the Missing Piece
This is where jitter comes in, and why it is more than a small implementation detail.
Jitter means adding a random offset to the backoff delay. Instead of every client waiting exactly four seconds, each client waits somewhere between, say, two and six seconds. The distribution spreads the herd out in time. Instead of a wave, you get a stream. Instead of a thousand simultaneous retries, you get roughly a thousand retries scattered across a window of several seconds - which is exactly what the service can actually handle.
The service gets breathing room. The retries eventually succeed. The system recovers. Nobody notices.
What was interesting to me is how small this change is in code. It is literally one extra line - adding a random value to the sleep before retry. The effect on system behavior is enormous. The behavior of the whole system depends on a function call that looks, to anyone reading the code, like a minor detail.
That asymmetry is, I think, the reason this problem keeps happening in real-world systems. The fix is trivial to implement. Recognizing that you need it - that backoff alone is not enough - is the hard part.
The Framing Shift
The part of the video that I keep coming back to is not any of the specific techniques. It is a framing shift in how I think about retries at all.
Before, I saw retries as a reliability mechanism - something I added to a client to protect users from transient failures. The mental model was: my client is defending itself against an unreliable network.
After, I see retries as a source of load - something I am producing that arrives at a server that may not be ready to receive it. The mental model is now closer to: my client is offering additional work to a service that may be in no state to accept it.
Both framings are true. But they lead to very different code. The first framing optimizes for "retry until success." The second framing optimizes for "retry in a way that does not make the failure worse."
I do not think I would have made that shift by reading a definition of the thundering herd problem. I had read definitions. What the video did was put it in a sequence - naive retry, then backoff, then backoff-plus-jitter - where each step visibly failed to solve the problem until the last one. Following the failure made the reasoning stick in a way that stating the conclusion did not.
What I'm Changing
Watching that video, I made a short list of things I wanted to audit in my own work. Writing them down here partly because the list is useful, and partly because I suspect other engineers have some version of the same list waiting to be written.
Every retry loop I own needs jitter, not just backoff. This is the one-line change. I checked a handful of services after watching the video and found backoff-without-jitter in more places than I want to admit. None of those services had caused incidents yet. That is not the same as being safe.
Client-side retry policy is a system-design decision, not a defensive local habit. If every team independently picks a retry policy for the calls they make, the aggregate behavior is not anyone's responsibility - which means it is nobody's. I want the retry defaults to live in a shared client, with jitter baked in, so that individual teams cannot accidentally skip it.
Servers need to be robust to thundering herds from their own clients. Load shedding, server-side rate limiting, and fast rejection with 429 responses are not about hostile traffic. They are about surviving the helpful traffic produced by your own well-meaning client retry loops. I used to think of those mechanisms as paranoia. Now I think of them as the other half of the retry story.
The absence of an incident is not evidence of safety. Thundering herd failures are a threshold effect. The system is fine up to some load, and then it is not. The transition is sharp. You do not get a gentle warning. You get a clean graph and then a vertical line.
The Short Version
If you retry without backoff, you make failures worse. If you back off without jitter, you delay the problem instead of solving it. If you back off with jitter, you give the service the one thing it needs to recover: uncorrelated traffic.
That is the entire story. It takes ten minutes to explain on video, about thirty seconds to implement in code, and - for a lot of us, me included - far too many years to actually internalize.
If you have not watched the video, watch it. If you have written retry logic recently, open that file and check for the jitter. It is almost always faster to fix now than to debug at 3 AM when a brief upstream hiccup turns into a multi-hour incident because every client in your fleet decided to retry on exactly the same second.
Related essays
Meta-Stable Failure: When Your System Is Up But Completely Down
The most dangerous distributed systems failures are the ones where everything looks fine, until it doesn't. Here's the failure mode that buries on-call engineers. testing
Why Engineers Are Obsessed With P99
If you only watch the average, you are watching the wrong number. P99 is where the money leaks, where the outages start, and where your users quietly decide to leave. testing
The Pattern Worth Paying For
Idempotency is the single most underrated contract in distributed systems - and ignoring it is how you end up charging customers twice at 3am. testing