Resilience · Cloud-Native · Advanced

Rate Limiting & Load Shedding

Protecting a service from too much traffic — rate limiting vs load shedding vs backpressure, graceful degradation under overload, and why dropping some requests beats collapsing under all of them.

Resilience Advanced ⏱ 4 min read Complete

🛟 Analogy

A lifeboat holds twelve. With thirty people in the water, the humane-sounding instinct is to let everyone climb in — and then the boat swamps and everyone drowns. Load shedding is the hard, correct call: take twelve, keep them alive, and signal for more boats. Accepting load you can’t carry doesn’t help the extra people; it sinks the ones you could have saved. A service past capacity is that lifeboat — shedding excess is how you keep anyone afloat.

Rate limiting — a policy: each client gets N requests/sec, enforced always (the token bucket). Ensures fairness and protects you from any one client.
Load shedding — self-preservation: when the server is at capacity right now, reject excess (fast 429/503) based on live signals (in-flight count, latency, queue depth).
Backpressure — a signal upstream: tell producers to slow down rather than buffering unbounded work.

graph LR
REQ["incoming requests"] --> RL["rate limit<br/>(per-client fairness)"]
RL --> ADM{"at capacity?"}
ADM -->|yes| SHED["shed: fast 429/503<br/>(graceful degradation)"]
ADM -->|no| WORK["process within capacity<br/>(fast & successful)"]
WORK -.full queue → backpressure.-> RL

See it: load shedding with an admission limit

This runs here — a semaphore caps concurrent work; requests over the limit are shed immediately instead of piling up. Output is deterministic:

▶ shed.go — editable & runnable

package main

import (
"fmt"
"sync"
)

func main() {
const capacity = 5 // max concurrent requests we can handle well
sem := make(chan struct{}, capacity)

var served, shed int
var mu sync.Mutex
var wg sync.WaitGroup

// 12 requests arrive at once; we can only handle 5 concurrently.
for i := 0; i < 12; i++ {
	wg.Add(1)
	go func() {
		defer wg.Done()
		select {
		case sem <- struct{}{}: // admitted
			mu.Lock()
			served++
			mu.Unlock()
			<-sem // (work happens here, then release)
		default: // at capacity — shed immediately, don't queue
			mu.Lock()
			shed++
			mu.Unlock()
		}
	}()
}
wg.Wait()
fmt.Printf("served: %d, shed (fast 429): %d\n", served, shed)
fmt.Println("-> accepted work stays within capacity and succeeds;")
fmt.Println("   excess is rejected fast instead of collapsing the service.")
}

Excess requests get a fast rejection (a 429/503 to the client) rather than queueing until everything times out. The accepted ones stay within capacity and succeed — partial availability beats total collapse.

Shed early, degrade gracefully

Two principles make shedding effective:

Reject early. Shed at admission (the edge), before a request consumes a DB connection and downstream calls — otherwise you waste the very resources you’re protecting. Make the rejection itself cheap.
Degrade gracefully. Don’t fail everything equally; shed non-critical work first. Serve a stale cache instead of recomputing, skip recommendations to keep checkout alive, drop low-priority background jobs before user-facing ones. The user gets a reduced-but-working service.

In Go, a bounded channel is natural backpressure — a full jobs channel blocks the producer, throttling the whole pipeline instead of buffering unboundedly.

🐹 Combine per-client limits with global shedding

The robust setup uses both axes: rate-limit per client (a token bucket keyed by API key/IP) so no single caller is unfair, and shed globally when total load exceeds capacity so a legitimate surge across many clients can’t sink you. Express capacity with a bounded semaphore/worker pool, set a hard cap on in-flight requests, return 429 with Retry-After so well-behaved clients back off, and classify requests by priority so degradation sheds the least-important work first. See rate limiting for the token-bucket and semaphore mechanics.

⚠️ Unbounded queues turn overload into collapse

The instinct to ‘just queue the extra requests’ is how overload becomes catastrophe. An unbounded queue under sustained overload grows without limit — latency climbs until every queued request has already timed out by the time it’s served (you do all the work for nothing), memory fills, and the process dies. This is congestion collapse. Bound every queue, and when it’s full, shed (reject) rather than accept — a fast failure the client can retry beats a slow failure that takes the whole service down. A queue is a shock absorber, not infinite storage; past its bound, the only safe answer is ‘no’.

Check your understanding

Score: 0 / 5

1. What's the difference between rate limiting and load shedding?

Rate limiting is a proactive policy — 'each client gets N req/sec' — enforced regardless of current load (the token bucket). Load shedding is reactive self-preservation — 'I'm at capacity, so reject excess now' — based on real-time signals (in-flight count, queue depth, latency). You use both: rate-limit per client for fairness, and shed load globally to survive a surge that slips past the limits.

2. Why is shedding (rejecting) some requests better than accepting all of them under overload?

Past a server's capacity, accepting more work doesn't do more work — queues grow, latency explodes, memory fills, and eventually everything times out or the process dies (congestion collapse). Shedding the excess (fast 429/503) keeps the accepted requests within capacity and succeeding. Serving 80% well beats serving 100% so badly that all of it fails.

3. What is backpressure?

Backpressure propagates 'slow down' from an overloaded component back to its callers, so the system as a whole throttles instead of one stage buffering unboundedly until it bursts. In Go a bounded channel is natural backpressure (a full channel blocks the sender); at an API boundary it's a 429 with Retry-After; in messaging it's bounded queues and consumer lag signals. The opposite — unbounded buffering — just delays and worsens the collapse.

4. What does graceful degradation mean under overload?

Graceful degradation prioritizes: when overloaded, protect core functionality by shedding or simplifying non-essential work — serve stale cache instead of recomputing, skip recommendations to keep checkout working, drop low-priority background jobs first. The user gets a degraded but working service instead of a uniform outage. It requires classifying requests by importance ahead of time.

5. Where should you reject excess load — early or deep in the stack?

Shedding cheaply at admission (a fast 429/503 before allocating expensive resources) protects capacity; shedding after the request has already taken a DB connection and made downstream calls wastes exactly what's scarce and can still topple the service. So enforce limits and admission control at the edge, fail fast, and make the rejection itself cheap.

Sync across devices

Rate Limiting & Load Shedding

See it: load shedding with an admission limit

Shed early, degrade gracefully

See also

Check your understanding

Comments

Three related tools

See it: load shedding with an admission limit

Shed early, degrade gracefully

See also

Related topics

Check your understanding

Comments