🚦 Analogy
Replacing a bridge while traffic flows. Rolling: close and rebuild one lane at a time — traffic always has a lane. Blue-green: build a whole second bridge alongside, then divert everyone over in one go (and keep the old one ready in case the new one wobbles). Canary: open the new bridge to a trickle of cars first, watch it hold, then let more across. All three avoid the one thing you must never do — shut the river crossing entirely while you work.
Three ways to ship without an outage
graph TD subgraph Rolling["Rolling (default)"] R["replace pods in batches; both versions briefly coexist"] end subgraph BG["Blue-Green"] BGd["green deployed beside blue → switch all traffic at once → instant flip-back"] end subgraph Canary["Canary"] C["1% → 5% → 25% → 100%, gated on metrics each step"] end
| Strategy | Extra infra | Rollback | Blast radius of a bad release | Best for |
|---|---|---|---|---|
| Rolling | none | slow (roll back) | medium (some pods) | the sane default |
| Blue-green | 2× during cutover | instant (flip) | all-or-nothing | fast rollback, big releases |
| Canary | a bit (routing) | fast (shift back) | tiny (1% of users) | risky changes, real-traffic validation |
The safety mechanism: health-gated promotion
What makes any of these safe isn’t the traffic shape — it’s gating each step on health. A new instance receives traffic only after its readiness probe passes; a canary advances to the next percentage only if its error rate and latency stay within budget. See health probes & graceful lifecycle for the probe mechanics.
See it: a canary promotion gate
The runnable below is an illustration of the decision logic, not a deployment tool you install — it shows how a canary controller observes the new version’s metrics and decides to promote, hold, or abort. Real progressive-delivery controllers (Argo Rollouts, Flagger) run this same kind of logic against live Prometheus metrics. This runs here:
package main
import "fmt"
type Metrics struct {
ErrorRate float64 // 0..1
P99ms int
}
// Budgets the canary must stay within to be promoted.
const (
maxErrRate = 0.02 // 2%
maxP99ms = 300
)
var steps = []int{1, 5, 25, 50, 100} // traffic % ladder
// decide promotes to the next step, holds, or aborts.
func decide(step int, m Metrics) (next int, verdict string) {
if m.ErrorRate > maxErrRate || m.P99ms > maxP99ms {
return 0, "ABORT → shift traffic back to stable"
}
for i, s := range steps {
if s == step && i+1 < len(steps) {
return steps[i+1], "PROMOTE"
}
}
return step, "DONE → canary is now stable"
}
func main() {
// Healthy canary climbs the ladder.
at := 1
for _, m := range []Metrics{{0.005, 120}, {0.01, 180}, {0.008, 200}, {0.012, 260}} {
next, v := decide(at, m)
fmt.Printf("at %3d%% err=%.1f%% p99=%dms -> %s (next %d%%)\n",
at, m.ErrorRate*100, m.P99ms, v, next)
at = next
}
// A bad release trips the gate and rolls back.
_, v := decide(25, Metrics{ErrorRate: 0.09, P99ms: 800})
fmt.Println("bad release at 25%:", v)
}
The gate makes the rollback decision automatic and metric-driven — exactly what tools like Argo Rollouts or Flagger do against real Prometheus metrics.
🐹 Go services are easy to deploy this way — if they shut down gracefully
A small static Go binary in a tiny container starts in milliseconds, so rolling and canary steps are quick. The one thing your code must do is cooperate with the rollout: implement a readiness probe that only reports ready once dependencies are wired, and handle SIGTERM to drain in-flight requests before exiting (server.Shutdown(ctx)). Without graceful shutdown, every pod the rollout retires drops its current requests — turning a “zero-downtime” strategy into a steady drip of 502s on each deploy. The traffic strategy is the orchestrator’s job; being safe to stop is yours.
⚠️ Both versions run at once — so changes must be compatible
Every strategy here has the old and new version live simultaneously, sharing one database and existing clients. A deploy that breaks the still-running version causes an outage during the rollout. Use the expand/contract (parallel-change) pattern for breaking changes: first deploy a version that adds the new (nullable column, new field, new endpoint) while still supporting the old; migrate data/clients; only then deploy a version that removes the old. Never drop a column or API field that the currently-running version still depends on. Database migrations especially must be backward-compatible across a single deploy step.
See also
- CI/CD & GitOps — the pipeline that drives these rollouts.
- Health probes & graceful lifecycle — the readiness/drain mechanics that make them zero-downtime.
- Autoscaling — scaling the replicas a rollout manages.
- Argo Rollouts & Flagger — progressive-delivery controllers.
Next: letting the platform size your service to its load — autoscaling.
Related topics
Getting a Go service from commit to production safely and repeatably — the CI vs CD split, what a Go pipeline runs, and GitOps: Git as the single source of truth a controller reconciles toward.
containersHealth Probes & Graceful LifecycleTelling Kubernetes the truth about your service — liveness vs readiness vs startup probes, and graceful shutdown that drains in-flight requests so deploys never drop traffic.
resilienceAutoscalingLetting the platform size your service to demand — horizontal vs vertical scaling, the Kubernetes HPA control loop and its replica formula, scaling on the right signal, and scale-to-zero.
Check your understanding
Score: 0 / 51. What is a rolling deployment?
A rolling deployment (Kubernetes' default) gradually replaces old pods with new ones in batches — spin up some new, wait for them to pass readiness, retire some old, repeat — so there's always enough healthy capacity serving traffic. It needs no extra infrastructure, but during the roll both versions run simultaneously (so changes must be backward-compatible), and rollback means rolling back, which takes time.
2. What characterizes a blue-green deployment?
Blue-green keeps two complete environments. The current (blue) serves production while you deploy and smoke-test the new version (green) in parallel. Then you cut traffic over to green in one switch (a load-balancer/router change). The big win is instant rollback — flip back to blue. The cost is running double the infrastructure during the cutover, and handling stateful concerns (DB migrations, in-flight sessions) across the switch.
3. What is a canary release?
A canary (named after canaries in coal mines) sends a small slice of production traffic — say 1%, then 5%, 25%, 100% — to the new version while monitoring its error rate, latency, and key metrics. If the canary stays healthy you promote it further; if it degrades you abort and shift traffic back. It limits the blast radius of a bad release to a small fraction of users and gives real-traffic signal that staging can't.
4. Why are readiness probes essential to safe rolling/canary deploys?
During a rollout, Kubernetes adds a new pod to the load-balancer rotation only after its readiness probe succeeds — so a pod that's still warming up (connecting to the DB, loading config) won't get traffic and cause errors. Combined with graceful shutdown on the old pods (drain in-flight requests), readiness probes are what make a rolling or canary deploy actually zero-downtime. Get them wrong and you drop requests on every deploy.
5. What requirement do ALL these strategies impose on your application changes?
Because rolling, blue-green, and canary all have both versions live at once (and talking to one database and existing clients), a deploy must not break the version still running. That means compatible changes: add a nullable column before writing to it, support both old and new message formats, never remove an API field clients still use until they've migrated. The 'expand/contract' (parallel-change) pattern — add the new, migrate, then remove the old across separate deploys — is how you ship breaking changes safely.
Comments
Sign in with GitHub to join the discussion.