🍽️ Analogy
A restaurant during a rush. Vertical scaling is giving your one chef a bigger stove — helps, but there’s a ceiling and you have to close the kitchen to install it. Horizontal scaling is calling in more cooks — no single-person limit, and if one quits the others carry on. The manager watches the ticket rail (the metric) and adds cooks when tickets pile up, sends them home when it’s quiet. But cooks take minutes to arrive — so during a sudden flood you also have to stop seating people (load shedding) until the extra hands show up.
Horizontal vs vertical
- Horizontal (scale out/in) — add/remove instances. Preferred for stateless services: no single-machine ceiling, built-in redundancy, matches load balancing.
- Vertical (scale up/down) — give each instance more CPU/memory. For things that can’t shard (some databases); hard upper limit, usually needs a restart.
A typical Go HTTP service is stateless → scale horizontally.
The HPA control loop
Kubernetes’ Horizontal Pod Autoscaler runs a periodic control loop: read the metric, compute the desired replica count, clamp to [min, max], and adjust.
graph LR M["observe metric<br/>(avg CPU / RPS / queue depth)"] --> C["desired = ceil(current × metric / target)"] C --> Cl["clamp to [min, max]"] Cl --> S["scale Deployment"] S --> M
See it: the HPA replica formula
The math is small and worth knowing — it’s the same proportional rule the HPA applies. This runs here:
package main
import (
"fmt"
"math"
)
// desiredReplicas implements the core HPA formula, clamped to [min,max].
func desiredReplicas(current int, metricNow, target float64, min, max int) int {
raw := math.Ceil(float64(current) * metricNow / target)
d := int(raw)
if d < min {
d = min
}
if d > max {
d = max
}
return d
}
func main() {
const target = 50.0 // target avg CPU %
min, max := 2, 10
// Load rising: 4 pods at 80% avg CPU -> needs more.
fmt.Println("80% load:", desiredReplicas(4, 80, target, min, max)) // ceil(4*80/50)=7
// At target: stable.
fmt.Println("50% load:", desiredReplicas(4, 50, target, min, max)) // 4
// Quiet: scales in, but not below min.
fmt.Println("10% load:", desiredReplicas(4, 10, target, min, max)) // max(min, ceil(0.8))=2
// Spike beyond capacity: clamped to max (then load-shed the rest).
fmt.Println("500% load:", desiredReplicas(4, 500, target, min, max)) // min(max,40)=10
}
Note the last line: past max, the autoscaler can’t help — that’s where rate limiting & load shedding takes over.
🐹 Go is built for horizontal scale (and cheap cold starts)
Go services autoscale beautifully: they’re stateless and small, a static binary in a tiny image starts in milliseconds, and the runtime sizes its threads to the CPU. Two tips. First, set GOMAXPROCS to match the pod’s CPU limit (use go.uber.org/automaxprocs) so a pod capped at 1 CPU doesn’t spin up dozens of OS threads and thrash. Second, scale on the metric that reflects your bottleneck — RPS or queue depth via custom/external metrics (KEDA), not just CPU, since an I/O-bound Go service can be saturated with idle CPU. The millisecond cold start also makes scale-to-zero genuinely practical.
⚠️ Scaling is not instant — and not free of state
Autoscaling reacts over seconds-to-minutes (detect → schedule → pull image → start → pass readiness). A sharp spike will saturate current capacity before reinforcements arrive, so pair it with load shedding to survive the gap. Also watch for: a too-tight or missing stabilization window causing flapping (scale up/down repeatedly); scaling a stateful workload that can’t simply add replicas (sticky sessions, leader election); and the autoscaler fighting your deploy if both change replica counts at once. Set sane min/max, a cooldown, and scale on a metric that won’t oscillate.
See also
- Rate limiting & load shedding — surviving the spike before scaling catches up.
- Deployment strategies — rollouts that interact with replica counts.
- Health probes & graceful lifecycle — readiness gates new replicas; graceful stop drains removed ones.
- Kubernetes HPA docs & KEDA — the autoscalers.
Next: the infrastructure layer that handles service-to-service traffic, retries, and mTLS for you — service mesh.
Related topics
Protecting a service from too much traffic — rate limiting vs load shedding vs backpressure, graceful degradation under overload, and why dropping some requests beats collapsing under all of them.
containersDeployment StrategiesShipping a new version without dropping traffic — rolling, blue-green, and canary releases, the health-gated promotion that makes them safe, and how readiness probes tie in.
containersHealth Probes & Graceful LifecycleTelling Kubernetes the truth about your service — liveness vs readiness vs startup probes, and graceful shutdown that drains in-flight requests so deploys never drop traffic.
Check your understanding
Score: 0 / 51. What's the difference between horizontal and vertical scaling?
Horizontal scaling (scale out/in) changes the number of instances; vertical scaling (scale up/down) changes the size of each instance. For stateless services — like a typical Go HTTP server — horizontal is preferred: it has no single-machine ceiling, gives redundancy, and matches how load balancers spread traffic. Vertical scaling suits things that can't easily shard (some databases) but has a hard upper limit and usually needs a restart.
2. How does the Kubernetes Horizontal Pod Autoscaler (HPA) decide the replica count?
The HPA periodically computes desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric). If 4 pods average 80% CPU against a 50% target, it wants ceil(4 × 80/50) = ceil(6.4) = 7 pods. It's a proportional control loop aiming to keep the observed average at the target, clamped between configured min/max replicas, with stabilization windows to avoid flapping.
3. Why is scaling on a request-based metric (e.g. requests-per-second or queue depth) often better than scaling on CPU?
CPU utilization is a convenient default but a leaky proxy. An I/O-bound Go service (waiting on DB/HTTP calls) can be overwhelmed — queues growing, latency climbing — while CPU sits low, so a CPU-based HPA won't scale up. Scaling on the signal that reflects real demand (requests/sec, p99 latency, or message-queue backlog via custom/external metrics, e.g. KEDA) tracks load far better. Pick the metric that actually correlates with 'we need more capacity.'
4. What is 'scale to zero' and what's its main trade-off?
Scale-to-zero (Knative, KEDA, many serverless platforms) drops a service to zero replicas when idle, so you pay nothing while it's not serving. The trade-off is the cold start: the first request after idling waits for an instance to be scheduled and started. Go shines here — a small static binary starts in milliseconds, making its cold starts far cheaper than runtimes that must boot a VM/interpreter. Keep a warm minimum if even that latency is unacceptable.
5. Why must autoscaling be paired with load shedding / rate limiting?
Autoscaling reacts on the order of seconds-to-minutes (detect, schedule, pull image, start, pass readiness). A sudden traffic spike can overwhelm current capacity in that window, so without protection the existing pods collapse before reinforcements arrive. Rate limiting and load shedding cover that gap — bounding concurrent work and rejecting excess gracefully — so the service degrades instead of dying while the autoscaler catches up. They're complementary: scaling for sustained demand, shedding for the transient overshoot.
Comments
Sign in with GitHub to join the discussion.