⚡ Analogy
Resilience patterns are the electrical safety in your house. A timeout is not leaving the tap running forever. Retries with backoff are flipping a tripped switch — but waiting, and not slamming it on and off. The circuit breaker is exactly that: when a circuit keeps faulting, it trips and stays off so the fault can’t burn the house down, then lets you test it later. Bulkheads are separate fuses per room, so a short in the kitchen doesn’t kill the lights everywhere. None of it prevents faults — it stops one fault from becoming a fire.
The layered defenses
In a distributed system, dependencies will fail. Resilience is layering defenses so one failure stays contained:
graph TD CALL["remote call"] --> TO["1. timeout — bound every wait"] TO --> RT["2. retry (backoff + jitter) — transient errors only"] RT --> CB["3. circuit breaker — fail fast when it's down"] CB --> BH["4. bulkhead — isolate per dependency"] BH --> FB["5. fallback — degrade gracefully"]
- Timeout — the foundation: every remote call gets a context deadline so a hang can’t tie up a goroutine.
- Retry with backoff + jitter — retry transient errors on idempotent ops, with growing, randomized delays.
- Circuit breaker — after repeated failures, stop calling a dead dependency and fail fast.
- Bulkhead — give each dependency its own bounded pool so one slow one can’t starve the rest.
- Fallback — when all else fails, degrade gracefully (cached value, default, partial response).
See it: a circuit breaker
This runs here — a breaker that opens after 3 consecutive failures (failing fast instead of calling a dead dependency), then half-opens to test recovery. Output is deterministic:
package main
import (
"errors"
"fmt"
)
type State int
const (
Closed State = iota // calls flow normally
Open // failing fast — dependency is down
HalfOpen // trial call to test recovery
)
type Breaker struct {
state State
failures int
threshold int
}
var ErrOpen = errors.New("circuit open: failing fast")
// Call runs fn unless the circuit is open.
func (b *Breaker) Call(fn func() error) error {
if b.state == Open {
b.state = HalfOpen // (a real breaker uses a cooldown timer here)
return ErrOpen
}
err := fn()
if err != nil {
b.failures++
if b.failures >= b.threshold {
b.state = Open
}
return err
}
b.failures = 0 // success resets
b.state = Closed
return nil
}
func main() {
b := &Breaker{threshold: 3}
dependencyDown := func() error { return errors.New("connection refused") }
dependencyUp := func() error { return nil }
for i := 1; i <= 4; i++ {
err := b.Call(dependencyDown)
fmt.Printf("call %d: state=%d err=%v\n", i, b.state, err)
}
// Circuit is now open → next call fails fast WITHOUT hitting the dependency.
fmt.Println("open → fail fast:", b.Call(dependencyDown))
// Half-open trial succeeds → circuit closes.
fmt.Println("recovery:", b.Call(dependencyUp), "state now Closed")
}
After 3 failures the breaker opens and the next call returns ErrOpen without touching the dead dependency — protecting both sides. A successful trial in half-open closes it. Production breakers (e.g. sony/gobreaker) add timers, success ratios, and metrics.
Safe retries
// Exponential backoff + jitter, transient errors only, bounded attempts.
for attempt := 0; attempt < maxRetries; attempt++ {
err := callWithTimeout(ctx) // each attempt has its own deadline
if err == nil || !isTransient(err) {
return err // success, or a permanent error (4xx) — don't retry
}
backoff := time.Duration(1<<attempt) * time.Second // 1s, 2s, 4s
jitter := time.Duration(rand.Int63n(int64(backoff))) // spread the herd
select {
case <-time.After(backoff/2 + jitter):
case <-ctx.Done():
return ctx.Err()
}
}
🐹 Timeouts first, then layer up — and retry only what's safe
Get timeouts on every remote call before anything else; without them no other pattern helps, because the failure mode (goroutines piling up on a hung dependency) is already fatal. Then add retries — but only for transient errors (timeouts, 503s) on idempotent operations, with backoff + jitter and a cap. Wrap flaky dependencies in a circuit breaker so you stop hammering a dead one, and isolate them with bulkheads (a per-dependency semaphore / pool). Go’s context, channels, and goroutines make all of these a few lines — the discipline is applying them to every external call, not the implementation.
⚠️ Naive retries amplify an outage (retry storms)
Retries are the pattern most likely to cause the outage they’re meant to survive. When a dependency slows down, every caller retrying immediately multiplies the load on it — 1 request becomes 3, across thousands of clients, and the extra traffic keeps it down (a ‘retry storm’). Worse, retries stack across layers: A retries B which retries C, so one user request becomes dozens of backend calls. Defend with backoff + jitter, a strict retry cap, a circuit breaker to halt retries when a dependency is clearly down, and ideally retry budgets (cap retries as a fraction of total traffic). Retrying harder is not retrying smarter.
See also
- gRPC & service comms — the calls these patterns protect.
- Rate limiting & load shedding — protecting yourself from too much traffic.
- Distributed transactions — resilience for multi-step workflows.
- context (concurrency) — the timeouts and cancellation underneath.
Next: protecting a service from too much load — rate limiting & load shedding.
Related topics
How Go services talk — gRPC vs REST, protobuf contracts and code generation, the four call types, and choosing synchronous calls vs asynchronous messaging.
resilienceRate Limiting & Load SheddingProtecting a service from too much traffic — rate limiting vs load shedding vs backpressure, graceful degradation under overload, and why dropping some requests beats collapsing under all of them.
resilienceDistributed Transactions & SagasKeeping data consistent across services without a global transaction — why two-phase commit doesn't fit microservices, the saga pattern with compensating actions, and orchestration vs choreography.
Check your understanding
Score: 0 / 51. Why is a timeout on every remote call the most fundamental resilience pattern?
An unbounded call is the root of cascading failure: one slow dependency causes callers to wait, their goroutines/connections pile up, they slow down, their callers pile up, and the outage propagates upstream. A context deadline on every remote call caps the wait, releases the goroutine/connection, and turns a hang into a fast, handleable error. Every other pattern assumes timeouts are already in place.
2. Why add exponential backoff AND jitter to retries?
Immediate, fixed retries hammer an already-struggling dependency and can keep it down. Exponential backoff (1s, 2s, 4s…) gives it room to recover. But if every client backs off on the same schedule, they retry in synchronized spikes — so add jitter (randomize the delay) to spread them out. Backoff + jitter + a retry cap is the standard safe retry.
3. What should you NOT retry?
Retry only transient failures (timeouts, 503s, connection resets) on idempotent operations. Retrying a 400/401/404 just wastes attempts — the input is wrong, not the timing. Retrying a non-idempotent write (a payment) without an idempotency key can charge twice. So: classify errors (retry transient, fail fast on permanent), and only retry writes that are idempotent.
4. What does a circuit breaker do?
A circuit breaker tracks failures to a dependency. Closed = calls flow normally. After a failure threshold it trips Open = calls fail fast immediately (no waiting on a known-dead service, giving it room to recover). After a cooldown it goes Half-Open = lets a trial call through; success closes it, failure re-opens it. It stops a dead dependency from tying up the caller and stops the caller from pummeling it.
5. What is the bulkhead pattern?
Named after a ship's watertight compartments: partition resources so a flood in one doesn't sink the ship. If all dependencies share one goroutine/connection pool, a slow one exhausts it and starves calls to healthy dependencies too. Give each dependency its own bounded pool / concurrency limit (a semaphore) so a failure stays contained to that one path.
Comments
Sign in with GitHub to join the discussion.