🧹 Analogy
Old garbage collectors shut the whole building to sweep — everyone stops working (a long stop-the-world pause). Go’s GC is a cleaning crew that works alongside you, only briefly asking everyone to freeze for a moment so it can note who’s standing where. And each goroutine’s desk (stack) is expandable: it starts tiny, and the instant you need more room the runtime swaps in a bigger desk and moves your papers over. Cheap goroutines and tiny pauses both come from this one idea — do the expensive work concurrently, freeze only for an instant.
Why this page exists
Two runtime features make idiomatic Go possible: a garbage collector tuned for low latency (not peak throughput) so you can write allocation-heavy code in a service that must answer in milliseconds, and growable stacks so you can spawn goroutines by the million. Both work by moving the costly part off the critical path — concurrent marking for the GC, lazy growth for stacks. Understanding them tells you why your service’s tail latency is good and where your allocations are coming from.
The concurrent tri-color mark-sweep collector
Go’s GC is a concurrent, tri-color, mark-sweep collector. Unpacking that:
- Mark-sweep: find everything reachable from the roots (mark), then reclaim everything else (sweep). No compaction — objects don’t move — which keeps pointers stable and avoids a moving phase.
- Tri-color: every object is white (unvisited, presumed garbage), grey (reachable, not yet scanned), or black (reachable, fully scanned). Marking starts roots grey, then repeatedly scans a grey object black and shades its pointees grey. When no grey remain, every white object is unreachable — it’s garbage.
- Concurrent: the marking runs while your goroutines run. This is the headline feature and the reason pauses are tiny.
graph LR R["roots: stacks + globals"] --> G1["grey: reachable, unscanned"] G1 -->|scan, shade pointees| B["black: reachable, scanned"] G1 -->|points to| W["white: presumed garbage"] W -.->|shaded grey when found| G1 B -.->|nothing white reachable| SWEEP["white set = garbage → swept"]
A cycle has two stop-the-world (STW) phases bracketing the concurrent work:
- STW sweep-termination / mark-start — turn on the write barrier, scan stack roots. Brief.
- Concurrent mark — your code runs; the GC marks. The bulk of the work, fully overlapped.
- STW mark-termination — finish, turn off the write barrier. Brief.
- Concurrent sweep — reclaim white objects lazily as memory is requested.
Both STW phases are typically tens of microseconds to well under a millisecond, independent of heap size, because the heavy scanning happens in phase 2 alongside your program. That latency profile is exactly why Go suits low-latency network services.
The write barrier: keeping concurrent marking honest
Concurrency creates a hazard. Suppose the marker has already scanned object A (it’s black, “done”), and then your code stores into A a pointer to a still-white object C, while simultaneously deleting the only other reference to C. The marker has finished with A, so it will never revisit it — and C looks unreferenced. C gets swept while still live: a use-after-free. This is the lost-object problem.
The fix is the write barrier: a few instructions the compiler injects on pointer writes during a GC cycle. When you write a pointer, the barrier shades the relevant object grey so the marker is guaranteed to (re-)scan it. Go uses a hybrid (Dijkstra-style insertion + Yuasa-style deletion) barrier so stacks don’t need a re-scan STW.
sequenceDiagram participant Code as your goroutine participant WB as write barrier participant GC as marker Note over GC: A already black (scanned) Code->>WB: A.next = C (C is white) WB->>GC: shade C grey (re-scan it) GC->>GC: C scanned black → survives Note over Code,GC: no live object lost
The barrier costs a little on every pointer write while a cycle is active — a real but small tax you pay so marking can be concurrent. It’s off between cycles, so non-GC time is unaffected.
Two knobs: GOGC and GOMEMLIMIT
The GC fires based on heap growth, controlled by two settings:
| Setting | Default | Meaning | Use it when |
|---|---|---|---|
GOGC | 100 | Start a GC when live heap has grown 100% since the last cycle’s marked size. 200 = GC half as often, more memory; off = disable trigger. | Trading memory for less GC CPU on a throughput job. |
GOMEMLIMIT | math.MaxInt64 (off) | A soft total-memory ceiling (Go 1.19+). As the heap nears it, the GC runs harder to stay under, ignoring the GOGC ratio. | Containers with a hard memory limit — set it just under the cgroup limit. |
GOGC alone is purely ratio-based, so a sudden allocation burst can overshoot a fixed memory limit before the next trigger — the classic container OOM. GOMEMLIMIT adds an absolute backstop. They compose: GOGC governs the steady-state frequency, GOMEMLIMIT caps the peak. It’s soft — it won’t crash you — but if live data genuinely exceeds the limit, the GC thrashes (a “death spiral”) instead of OOMing; set it from a real limit, not a guess.
GOGC=200 ./server # GC half as often; ~2x peak heap
GOMEMLIMIT=900MiB ./server # soft cap for a 1 GiB container
GODEBUG=gctrace=1 ./server # log each GC: pause, heap size, CPU%
Growable, contiguous stacks
A goroutine starts with a ~2 KB stack — versus the ~1 MB an OS thread reserves. That 500× difference is why a million goroutines fit in memory. Stacks aren’t fixed: the function prologue contains a tiny stack-bounds check, and when a call would overflow, the runtime performs a stack copy:
- Allocate a new segment, roughly double the current size.
- Copy all existing frames into it.
- Fix up pointers that referenced the old stack (the runtime knows the layout from stack maps).
- Continue the call on the new, larger stack.
Because the new stack is one contiguous block (Go abandoned segmented “hot-split” stacks in 1.3), there’s no per-call boundary cost. Stacks also shrink: during GC, a goroutine using far less than its stack can be copied down to a smaller one, returning memory.
graph LR S1["new goroutine: ~2 KB"] -->|deeper calls overflow| S2["copy frames → ~4 KB"] S2 -->|deeper still| S3["copy → ~8 KB → ..."] S3 -.->|GC sees low usage, shrinks| S1
The tradeoff: a function on a hot path that repeatedly grows and shrinks its stack pays for the copies. Rare, but visible in profiles as runtime.morestack / runtime.copystack — usually fixed by avoiding huge stack-allocated arrays or extreme recursion depth.
Escape analysis: stack is free, heap is not
Where a value lives is decided at compile time by escape analysis. If the compiler proves a value does not outlive its function, it lives on the stack — allocated by bumping the stack pointer, freed for free on return, invisible to the GC. If the value escapes, it’s heap-allocated and the GC must track it.
Common reasons a value escapes:
- Returned by pointer —
return &x(the caller outlives the frame). - Captured by a goroutine or closure that outlives the function.
- Stored in an interface —
var w io.Writer = &bufoften boxes onto the heap. - Too large or dynamically sized to fit the stack safely (e.g. a slice whose size isn’t known).
- Referenced by something that escapes (transitivity).
go build -gcflags=-m ./... # "escapes to heap" / "does not escape" / "moved to heap"
// does NOT escape: sum lives and dies in addUp → stack, no allocation
func addUp(xs []int) int {
sum := 0
for _, x := range xs {
sum += x
}
return sum
}
// DOES escape: the returned pointer outlives the frame → heap allocation
func newCounter() *int {
c := 0
return &c // -m prints: moved to heap: c
}
Fewer escapes means fewer heap allocations means less GC work — which is why escape analysis, not the GC tuning knobs, is usually the first place to look when reducing allocation pressure.
sync.Pool: reuse instead of re-allocate
When you must allocate repeatedly on a hot path (per-request buffers, encoders), sync.Pool lets goroutines reuse objects instead of minting new garbage each time, cutting GC pressure. Pooled objects can be reclaimed by the GC between cycles, so it’s a cache, not a guarantee — always reset state on Get. This runs here, and the allocation counts tell the whole story:
package main
import (
"bytes"
"fmt"
"runtime"
"sync"
)
var bufPool = sync.Pool{New: func() any { return new(bytes.Buffer) }}
func mallocs() uint64 {
var m runtime.MemStats
runtime.ReadMemStats(&m)
return m.Mallocs // cumulative count of heap objects allocated
}
func main() {
const n = 100_000
// Without a pool: a fresh buffer every call → n allocations.
a := mallocs()
for i := 0; i < n; i++ {
buf := new(bytes.Buffer)
buf.WriteString("hello")
_ = buf.Len()
}
fmt.Printf("no pool: ~%d heap allocations\n", mallocs()-a)
// With a pool: reuse buffers → near-zero allocations after warmup.
for i := 0; i < 1000; i++ { // warm the pool
bufPool.Put(bufPool.Get())
}
c := mallocs()
for i := 0; i < n; i++ {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset() // pooled objects are dirty — reset!
buf.WriteString("hello")
_ = buf.Len()
bufPool.Put(buf) // return for the next caller
}
fmt.Printf("pooled: ~%d heap allocations (same %d calls)\n", mallocs()-c, n)
}
The pooled loop does the same work with near-zero allocations — that’s the GC pressure you’ve removed. See the sync package for the full API and when pooling actually pays off (it can hurt for cheap or rarely-reused objects).
Watch the GC work — deterministically
runtime.GC() forces a full cycle, and NumGC (the count of completed cycles) only ever increases. We print booleans and a monotonic delta, not byte counts, so the output is identical on every machine.
package main
import (
"fmt"
"runtime"
)
func main() {
var before, after runtime.MemStats
runtime.ReadMemStats(&before)
// Allocate a few MiB of immediate garbage (unreferenced → collectable).
for i := 0; i < 16; i++ {
_ = make([]byte, 1<<20) // 1 MiB each, dropped immediately
}
runtime.GC() // force exactly one full cycle
runtime.ReadMemStats(&after)
// NumGC is monotonic, so this delta is stable everywhere.
fmt.Println("at least one GC ran:", after.NumGC > before.NumGC)
// TotalAlloc is cumulative and never decreases — we allocated 16 MiB.
fmt.Println("allocated more after the loop:", after.TotalAlloc > before.TotalAlloc)
// GCCPUFraction is a fraction in [0,1] — a stable invariant.
fmt.Println("GC CPU fraction <= 1:", after.GCCPUFraction <= 1)
// After a forced GC, the next-GC heap goal is set (non-zero) — stable.
fmt.Println("next-GC goal is set:", after.NextGC > 0)
}
Every line prints a boolean that holds on any machine and any Go build — the deterministic-output rule for playgrounds. (Printing HeapAlloc in bytes would vary by platform and GC timing, so we avoid it.)
When this matters in practice
- Reduce allocations before tuning the GC. The biggest win is usually allocating less — keep values on the stack (mind escape analysis), preallocate slices with
make([]T, 0, n), reuse buffers withsync.Pool. Measure withgo test -bench . -benchmemandpproffirst. - Set
GOMEMLIMITin memory-capped containers. Just under the cgroup limit; it prevents GOGC-overshoot OOMs. Pair with scheduler container awareness. - Raise
GOGCfor batch/throughput jobs that have memory to spare and don’t care about pause frequency. - Don’t micro-optimize stacks. Growable stacks are automatic; only deep recursion or giant stack arrays show up as
copystackin a profile. sync.Poolis for hot, reused, resettable objects — not a general object cache; it can hurt otherwise.
⚠️ GOMEMLIMIT is a soft cap, not an OOM killer
GOMEMLIMIT makes the GC try to stay under a byte ceiling — it does not refuse allocations or abort the program. If your live (reachable) data genuinely exceeds the limit, the GC keeps running back-to-back to claw back memory it can’t actually free, burning CPU in a death spiral instead of giving you a clean OOM. Set it from a real memory limit with headroom, and don’t combine GOMEMLIMIT with GOGC=off unless you’ve verified live data stays well below the cap.
🐹 The runtime is the reason everything here is cheap
Cheap goroutines, tiny stacks, a collector that mostly stays out of the way — these are why the whole concurrency track works. You spawn goroutines by the million, the scheduler spreads them across cores, and the GC reclaims their garbage without freezing your service. None of it requires manual memory management; all of it rewards allocating thoughtfully.
Next: the rules for what one goroutine is guaranteed to see of another’s writes — the Go memory model.
Related topics
The M:N scheduler — G/M/P, local and global run queues, work-stealing, syscall handoff, the netpoller, preemption, and GOMAXPROCS.
building-blocksGoroutinesGo's lightweight, runtime-scheduled concurrent functions — the fork-join model, their tiny cost, M:N scheduling, and how to avoid leaks.
building-blocksThe Go Memory ModelHappens-before — the rules that decide when one goroutine is guaranteed to see another goroutine's writes.
Check your understanding
Score: 0 / 51. Why are Go's GC pauses so short (often tens of microseconds)?
Go uses a concurrent tri-color mark-sweep collector with a write barrier, so the heavy marking runs while your code runs. Only two short stop-the-world phases remain (start-mark and mark-termination), each typically well under a millisecond.
2. What is the write barrier for?
Because marking runs concurrently with your code, a pointer write could hide a still-reachable object from the marker (the lost-object problem). The write barrier shades the involved object grey so it's re-scanned, preserving correctness without a full stop-the-world.
3. How big is a goroutine's stack, and what happens when it needs more?
Small, growable, contiguous stacks are why millions of goroutines fit in memory. A prologue check detects an impending overflow; the runtime copies the stack to a segment roughly double the size and rewrites pointers into it. Idle deep stacks can be shrunk back during GC.
4. What does escape analysis decide?
At compile time, if the compiler proves a value does not outlive its function, it stays on the stack — no allocation, no GC pressure. If it 'escapes' (returned by pointer, captured by a goroutine, stored in an interface), it goes on the heap. Inspect with `go build -gcflags=-m`.
5. What does GOMEMLIMIT (Go 1.19+) do that GOGC alone cannot?
GOGC triggers GC purely on heap growth ratio, which can overshoot a container's memory limit during a burst. GOMEMLIMIT adds a soft byte ceiling: as live heap nears it, the GC runs harder to stay under. It's soft (it won't crash you) but a runaway can still thrash; pair it with GOGC=off only when you understand the workload.
Comments
Sign in with GitHub to join the discussion.