Retry Storms: The Silent System Killer
Retries are one of those things that feel obviously correct.
Something fails? Just retry.
Most of us have written code like this without thinking twice:
for i := 0; i < 3; i++ {
err := call()
if err == nil {
return nil
}
}
It works. It feels defensive. It ships.
And then one day, under real load, your system collapses — not because something failed, but because everything retried at once.
That’s a retry storm.
What is a retry storm?
A retry storm happens when:
- many requests fail around the same time
- each failure triggers retries
- retries hit the same degraded dependency
- load increases instead of decreases
- recovery becomes impossible
Nothing is “wrong” with retries individually.
The failure emerges collectively.
Retry storms don’t cause outages. They prevent recovery.
The subtle trap: retries amplify load
Imagine this sequence:
- A downstream service slows down
- Requests start timing out
- Callers retry immediately
- Downstream gets more traffic
- Latency increases further
- More retries happen
You’ve now built a positive feedback loop.
The system isn’t broken. It’s overwhelmed.
Why retry storms are so dangerous
Retry storms are hard to diagnose because:
- logs show “handled errors”
- metrics show high throughput
- no single component looks obviously broken
- everything is technically “working”
But latency keeps climbing. Error rates stay high. Pods restart. Connections churn.
The system is busy failing.
Common retry storm patterns
1. Immediate retries
for {
if err := call(); err == nil {
return
}
}
This is the worst case:
- no delay
- no limit
- no cancellation
A single failure can melt your system.
2. Synchronized retries
Even with limits:
time.Sleep(1 * time.Second)
If thousands of callers retry with the same delay:
- they wake up together
- they hit the dependency together
- they fail together
This is called the thundering herd problem.
3. Retries without deadlines
If retries don’t respect context:
- they ignore timeouts
- they ignore shutdown
- they pile up during deploys
Retries must be cancellable.
Why retries feel safe (but aren’t)
Retries give psychological comfort:
- “We handled failure”
- “We didn’t give up”
- “The system is resilient”
But resilience is not about retrying harder.
Resilience is about knowing when to stop.
Designing retries that don’t kill you
1. Always bound retries
Retries without limits are not retries. They’re loops.
for i := 0; i < maxRetries; i++ {
...
}
Bound everything:
- retry count
- retry duration
- total time spent
2. Add jitter — always
Never retry on a fixed schedule.
sleep := base + rand.Jitter()
time.Sleep(sleep)
Jitter breaks synchronization. It gives systems room to breathe.
No jitter = herd behavior.
3. Respect context
Retries must stop when the work is no longer valid.
select {
case <-time.After(delay):
retry()
case <-ctx.Done():
return ctx.Err()
}
If the request is cancelled, retries should stop immediately.
4. Separate retryable from non-retryable errors
Not every error deserves a retry.
Retry:
- timeouts
- transient network failures
- temporary overloads
Do not retry:
- validation errors
- authorization failures
- business rule violations
Retrying permanent errors just wastes capacity.
5. Combine retries with backpressure
Retries must compete for the same resources as normal work.
If your system is overloaded:
- retries should slow down
- or be dropped
Retries are not special.
Circuit breakers are not optional
At some point, retries must give up.
Circuit breakers:
- stop sending traffic to a failing dependency
- allow recovery time
- protect the rest of the system
Retries without circuit breakers just prolong failure.
A quick Python comparison (only to clarify)
In async Python systems:
- retries often live in middleware
- tasks can silently pile up in the event loop
- cancellation is easy to forget
Different runtime. Same problem.
Retry storms are a design issue, not a language issue.
How retry storms show up in production
If you see:
- high CPU with low useful throughput
- databases at max connections
- repeated timeouts with no clear root cause
- services that never quite recover
Suspect retries.
Especially “helpful” ones.
A simple mental model
Retries should behave like:
- pressure valves, not pumps
- shock absorbers, not amplifiers
If a dependency is failing:
- reduce pressure
- spread retries
- stop when necessary
The takeaway
Retries are powerful. They’re also dangerous.
The goal of retries is not to succeed at all costs. It’s to give the system a chance to recover.
Bound them. Jitter them. Cancel them. And know when to stop.
Because the most dangerous failures are not crashes.
They’re systems that refuse to let themselves heal.
Related reading
- Bounded Concurrency Beats Clever Concurrency
- Graceful Shutdown Is a Feature, Not a Signal Handler
- Channels Are Not Queues
- Async Doesn’t Make Your System Fast — It Makes It Honest