Resilience · Cloud-Native · Advanced

Resilience Patterns

Keeping a service up when its dependencies fail — timeouts, retries with exponential backoff and jitter, the circuit breaker, bulkheads, and graceful fallback.

Resilience Advanced ⏱ 4 min read Complete

⚡ Analogy

Resilience patterns are the electrical safety in your house. A timeout is not leaving the tap running forever. Retries with backoff are flipping a tripped switch — but waiting, and not slamming it on and off. The circuit breaker is exactly that: when a circuit keeps faulting, it trips and stays off so the fault can’t burn the house down, then lets you test it later. Bulkheads are separate fuses per room, so a short in the kitchen doesn’t kill the lights everywhere. None of it prevents faults — it stops one fault from becoming a fire.

The layered defenses

In a distributed system, dependencies will fail. Resilience is layering defenses so one failure stays contained:

graph TD
CALL["remote call"] --> TO["1. timeout — bound every wait"]
TO --> RT["2. retry (backoff + jitter) — transient errors only"]
RT --> CB["3. circuit breaker — fail fast when it's down"]
CB --> BH["4. bulkhead — isolate per dependency"]
BH --> FB["5. fallback — degrade gracefully"]

Timeout — the foundation: every remote call gets a context deadline so a hang can’t tie up a goroutine.
Retry with backoff + jitter — retry transient errors on idempotent ops, with growing, randomized delays.
Circuit breaker — after repeated failures, stop calling a dead dependency and fail fast.
Bulkhead — give each dependency its own bounded pool so one slow one can’t starve the rest.
Fallback — when all else fails, degrade gracefully (cached value, default, partial response).

See it: a circuit breaker

This runs here — a breaker that opens after 3 consecutive failures (failing fast instead of calling a dead dependency), then half-opens to test recovery. Output is deterministic:

▶ breaker.go — editable & runnable

package main

import (
"errors"
"fmt"
)

type State int

const (
Closed State = iota // calls flow normally
Open                // failing fast — dependency is down
HalfOpen            // trial call to test recovery
)

type Breaker struct {
state     State
failures  int
threshold int
}

var ErrOpen = errors.New("circuit open: failing fast")

// Call runs fn unless the circuit is open.
func (b *Breaker) Call(fn func() error) error {
if b.state == Open {
	b.state = HalfOpen // (a real breaker uses a cooldown timer here)
	return ErrOpen
}
err := fn()
if err != nil {
	b.failures++
	if b.failures >= b.threshold {
		b.state = Open
	}
	return err
}
b.failures = 0 // success resets
b.state = Closed
return nil
}

func main() {
b := &Breaker{threshold: 3}
dependencyDown := func() error { return errors.New("connection refused") }
dependencyUp := func() error { return nil }

for i := 1; i <= 4; i++ {
	err := b.Call(dependencyDown)
	fmt.Printf("call %d: state=%d err=%v\n", i, b.state, err)
}
// Circuit is now open → next call fails fast WITHOUT hitting the dependency.
fmt.Println("open → fail fast:", b.Call(dependencyDown))
// Half-open trial succeeds → circuit closes.
fmt.Println("recovery:", b.Call(dependencyUp), "state now Closed")
}

After 3 failures the breaker opens and the next call returns ErrOpen without touching the dead dependency — protecting both sides. A successful trial in half-open closes it. Production breakers (e.g. sony/gobreaker) add timers, success ratios, and metrics.

Safe retries

// Exponential backoff + jitter, transient errors only, bounded attempts.
for attempt := 0; attempt < maxRetries; attempt++ {
	err := callWithTimeout(ctx) // each attempt has its own deadline
	if err == nil || !isTransient(err) {
		return err // success, or a permanent error (4xx) — don't retry
	}
	backoff := time.Duration(1<<attempt) * time.Second   // 1s, 2s, 4s
	jitter := time.Duration(rand.Int63n(int64(backoff)))  // spread the herd
	select {
	case <-time.After(backoff/2 + jitter):
	case <-ctx.Done():
		return ctx.Err()
	}
}

🐹 Timeouts first, then layer up — and retry only what's safe

Get timeouts on every remote call before anything else; without them no other pattern helps, because the failure mode (goroutines piling up on a hung dependency) is already fatal. Then add retries — but only for transient errors (timeouts, 503s) on idempotent operations, with backoff + jitter and a cap. Wrap flaky dependencies in a circuit breaker so you stop hammering a dead one, and isolate them with bulkheads (a per-dependency semaphore / pool). Go’s context, channels, and goroutines make all of these a few lines — the discipline is applying them to every external call, not the implementation.

⚠️ Naive retries amplify an outage (retry storms)

Retries are the pattern most likely to cause the outage they’re meant to survive. When a dependency slows down, every caller retrying immediately multiplies the load on it — 1 request becomes 3, across thousands of clients, and the extra traffic keeps it down (a ‘retry storm’). Worse, retries stack across layers: A retries B which retries C, so one user request becomes dozens of backend calls. Defend with backoff + jitter, a strict retry cap, a circuit breaker to halt retries when a dependency is clearly down, and ideally retry budgets (cap retries as a fraction of total traffic). Retrying harder is not retrying smarter.

Check your understanding

Score: 0 / 5

1. Why is a timeout on every remote call the most fundamental resilience pattern?

An unbounded call is the root of cascading failure: one slow dependency causes callers to wait, their goroutines/connections pile up, they slow down, their callers pile up, and the outage propagates upstream. A context deadline on every remote call caps the wait, releases the goroutine/connection, and turns a hang into a fast, handleable error. Every other pattern assumes timeouts are already in place.

2. Why add exponential backoff AND jitter to retries?

Immediate, fixed retries hammer an already-struggling dependency and can keep it down. Exponential backoff (1s, 2s, 4s…) gives it room to recover. But if every client backs off on the same schedule, they retry in synchronized spikes — so add jitter (randomize the delay) to spread them out. Backoff + jitter + a retry cap is the standard safe retry.

3. What should you NOT retry?

Retry only transient failures (timeouts, 503s, connection resets) on idempotent operations. Retrying a 400/401/404 just wastes attempts — the input is wrong, not the timing. Retrying a non-idempotent write (a payment) without an idempotency key can charge twice. So: classify errors (retry transient, fail fast on permanent), and only retry writes that are idempotent.

4. What does a circuit breaker do?

A circuit breaker tracks failures to a dependency. Closed = calls flow normally. After a failure threshold it trips Open = calls fail fast immediately (no waiting on a known-dead service, giving it room to recover). After a cooldown it goes Half-Open = lets a trial call through; success closes it, failure re-opens it. It stops a dead dependency from tying up the caller and stops the caller from pummeling it.

5. What is the bulkhead pattern?

Named after a ship's watertight compartments: partition resources so a flood in one doesn't sink the ship. If all dependencies share one goroutine/connection pool, a slow one exhausts it and starves calls to healthy dependencies too. Give each dependency its own bounded pool / concurrency limit (a semaphore) so a failure stays contained to that one path.

Sync across devices

Resilience Patterns

The layered defenses

See it: a circuit breaker

Safe retries

See also

Check your understanding

Comments

The layered defenses

See it: a circuit breaker

Safe retries

See also

Related topics

Check your understanding

Comments