{} The Go Reference

Runtime · Concurrency · Advanced

The Go Scheduler

The M:N scheduler — G/M/P, local and global run queues, work-stealing, syscall handoff, the netpoller, preemption, and GOMAXPROCS.

Runtime Advanced ⏱ 13 min read Complete

👩‍🍳 Analogy

A kitchen has a few stations (P), a pool of cooks (M), and a stream of orders (G). A cook works one station, pulling tickets from that station’s rail. If a cook’s rail empties, they grab tickets from a busier station (work-stealing). If a cook has to step into the walk-in freezer and wait (a blocking syscall), they hand the station to another cook so it never sits idle. And if one cook keeps hogging a station, the head chef (sysmon) taps them on the shoulder and rotates them out. Master this one picture and the rest of the runtime track falls into place.

The M:N model

Most languages map one concurrency unit to one OS thread, so 100,000 of them means 100,000 threads — gigabytes of stack and a thrashing kernel scheduler. Go instead multiplexes many goroutines onto few threads — an M:N scheduler — with three actors:

  • G — goroutine. Your concurrent function, its stack, and its scheduling state. Cheap: a g struct plus a ~2 KB growable stack. You can have millions.
  • M — machine. An OS thread, the only thing that actually executes instructions. Expensive: each is a real kernel thread.
  • P — processor. A scheduling context — the permission-to-run-Go-code token. A P owns a local run queue of runnable goroutines and a slice of caches (the mcache for allocation). To run Go code, an M must hold a P.

The key invariant: you have exactly GOMAXPROCS Ps, so at most GOMAXPROCS goroutines run Go code simultaneously. That is the cap on true parallelism. Threads (Ms) can outnumber Ps — extras exist to absorb blocking syscalls — but only GOMAXPROCS of them hold a P and run Go at any instant.

graph TD
subgraph P1["P0 (local run queue)"]
  G1["G"] --- G2["G"] --- G3["G"]
end
subgraph P2["P1 (local run queue)"]
  G4["G"] --- G5["G"]
end
GQ["global run queue"]
M1["M0 (OS thread)"] --> P1
M2["M1 (OS thread)"] --> P2
P2 -. "steals half" .-> P1
GQ -. "drained when local empties" .-> P2

Think of it as three layers: G is the work, P is a seat with a to-do list, M is a worker who must sit in a seat to do anything. The scheduler’s whole job is keeping a goroutine on every seat.

Run queues: local, global, and runnext

Runnable goroutines live in three places, checked in a deliberate order:

QueueScopeCapacityRole
runnext slotper-P1 GThe just-woken goroutine, run next for cache locality (e.g. the receiver after a channel send).
Local run queueper-P256 Gs (ring buffer)The hot path — lock-free for the owning P, so most scheduling needs no contention.
Global run queuesharedunbounded (linked list)Overflow from full local queues and goroutines with no obvious home; guarded by a lock.

When a P needs work it looks in this order: runnext → local queue → global queue (occasionally, to avoid starving it) → the netpoller → steal from another P. The local queue being lock-free for its owner is why scheduling is cheap: in the common case a P just pops its own ring buffer with no atomics on the fast path. New goroutines from go f() go onto the current P’s runnext/local queue; if the local queue is full, a batch overflows to the global queue so one busy producer can’t monopolize a single P.

Work-stealing keeps cores busy

When a P’s local queue and runnext are empty, it does not go idle. It runs the findrunnable loop: peek at the global queue, poll the netpoller, then steal from a peer. Stealing picks a random victim P and takes half of its local queue.

graph LR
IDLE["P1 local queue empty"] --> GLOBAL{"global queue<br/>has work?"}
GLOBAL -->|yes| TAKE["take a batch"]
GLOBAL -->|no| POLL{"netpoller<br/>ready Gs?"}
POLL -->|yes| RUNNET["run I/O-ready Gs"]
POLL -->|no| STEAL["steal half of a<br/>random P's queue"]
STEAL --> RUN["run stolen Gs"]

Why half and not one? Taking half means the victim isn’t drained on the very next steal, and the stealer gets enough work to stay busy for a while — load balances in O(log P) hops instead of one-at-a-time ping-pong. Why random victims? It avoids every idle P piling onto the same unlucky queue. The result is balanced load with no central scheduler on the hot path — the thing that lets Go scale to many cores. If findrunnable finds nothing anywhere, the P is parked on an idle list and its M may sleep, to be woken when work appears.

Blocking syscalls: handing off the P

A goroutine doing os.Read, a CGO call, or any blocking syscall must block its M in the kernel — there’s no avoiding that. The trick is that the P does not block with it.

sequenceDiagram
participant G as G (syscall)
participant M0 as M0
participant P as P
participant M1 as M1 (idle/new)
G->>M0: enter blocking syscall
M0->>P: detach P (handoff)
P->>M1: M1 acquires P
Note over M1,P: P's other goroutines keep running
G-->>M0: syscall returns
M0->>M0: try to reacquire a P
Note over M0: if none free, park G on a queue and M0 sleeps

On entering a blocking syscall (entersyscall), the runtime marks the P as detachable. The background sysmon thread (or the M itself, on the fast path via entersyscallblock) hands the P to an idle M — or spins up a new one — so the P’s remaining goroutines keep executing. When the syscall returns (exitsyscall), the original M tries to grab a P back; if none is free, it parks its goroutine on the global queue and goes to sleep in the M pool. This is why one slow read() never freezes your whole program — and also why a flood of blocking calls can spawn many Ms (see the gotcha). Non-blocking syscalls that return instantly use a cheaper path that doesn’t hand off the P.

The netpoller: parking on I/O

Network I/O is special. If every blocked socket read tied up an M, a server with 10,000 idle connections would need 10,000 threads. Instead, Go’s network, timer, and (on some platforms) file primitives are built on the network poller — a thin layer over epoll (Linux), kqueue (BSD/macOS), or IOCP (Windows).

When a goroutine reads from a socket with no data ready, the runtime parks the goroutine (removes it from any run queue, no M held) and registers the fd with the poller. The M is now free to run other goroutines. A dedicated check of the poller — inside findrunnable and periodically by sysmon — asks the kernel “which fds are ready?” and un-parks the corresponding goroutines back onto a run queue.

graph TD
G["G: conn.Read() — no data"] --> PARK["park G, register fd with epoll/kqueue"]
PARK --> FREE["M is free to run other Gs"]
POLLER["netpoller checks kernel"] --> READY{"fd ready?"}
READY -->|yes| WAKE["mark G runnable → run queue"]
READY -->|no| POLLER

The payoff: blocking-style code, non-blocking performance. You write the obvious n, err := conn.Read(buf) and the runtime quietly turns it into an event-loop registration — no callbacks, no async/await. Tens of thousands of idle connections cost a handful of threads, not one each. This is the engine under every Go HTTP server.

Preemption: cooperative, then asynchronous

A goroutine must eventually yield its P so others get a turn. Go uses two mechanisms:

  • Cooperative preemption (always). At function-call prologues the compiler inserts a tiny stack-growth/preempt check. When sysmon flags a goroutine as running too long (~10 ms), that check trips on the next call and the goroutine yields. Cheap, but it only fires at a call.
  • Asynchronous preemption (Go 1.14+). A goroutine in a tight loop with no callsfor { x++ } — never hits a prologue, so before 1.14 it could hog a P indefinitely, starving every other goroutine on that P and stalling the GC (which needs all goroutines to reach a safe point). Go 1.14 fixed this: sysmon sends the M a signal (SIGURG); the signal handler stops the goroutine at a register-safe instruction, parks it, and frees the P.

⚠️ A tight loop is no longer a deadlock — but it's still rude

With async preemption, a busy loop won’t freeze the program — sysmon will rotate it out. But it still burns a whole P spinning on nothing, and tight loops in cgo or with the signal masked can still resist preemption. Don’t spin-wait. Block on a channel or sync.Cond, or select on a context, so the goroutine parks (zero CPU) instead of busy-waiting. A spin loop “works” but wastes a core another goroutine could use.

The runtime also preempts for STW events: when the GC needs every goroutine at a safe point, async preemption is what guarantees even a calls-free loop stops promptly — which is why GC pauses stay sub-millisecond.

GOMAXPROCS and the container default

GOMAXPROCS sets the number of Ps. Read it with runtime.GOMAXPROCS(0) (the 0 means “report, don’t change”); set it with the env var GOMAXPROCS or the same call with a positive argument.

GOMAXPROCS=4 ./server          # pin to 4 Ps regardless of core count
GODEBUG=schedtrace=1000 ./server   # log scheduler state (G/M/P counts) every 1s
GODEBUG=scheddetail=1,schedtrace=1000 ./server  # add per-P detail
// Read (never blindly hard-code) the current value; setting returns the old one.
n := runtime.GOMAXPROCS(0) // report only
old := runtime.GOMAXPROCS(2) // set to 2, returns previous value
_ = old
_ = n
Go versionDefault GOMAXPROCSContainer behavior
≤ 1.41one P unless you set it
1.5 – 1.24runtime.NumCPU() (host cores)ignores cgroup CPU limits — over-provisions in pods
1.25+cgroup CPU limit (Linux), rounded up, min 2; else host corescontainer-aware by default; updates live if the limit changes

🐹 You almost never set GOMAXPROCS — except in containers (pre-1.25)

The default — one P per available CPU — is right for nearly everything; pinning it manually usually hurts. The classic trap is containers before Go 1.25: the runtime read the host’s core count, not your cgroup quota, so a 2-CPU-limited pod on a 64-core node spun up 64 Ps. That over-subscription thrashes the OS scheduler, inflates GC assist work, and tanks tail latency. Go 1.25 fixed the default to honor the cgroup limit on Linux. On older runtimes, set GOMAXPROCS explicitly from your limit, or import go.uber.org/automaxprocs to do it for you.

See the runtime for yourself

This prints stable facts about the scheduler. A WaitGroup barrier makes the goroutine count deterministic: we hold all 1000 workers at a gate, observe the live count, then release them.

scheduler.go — editable & runnable
package main

import (
"fmt"
"runtime"
"sync"
)

func main() {
fmt.Println("NumCPU:       ", runtime.NumCPU() >= 1) // true on any machine
fmt.Println("GOMAXPROCS>=1:", runtime.GOMAXPROCS(0) >= 1)
fmt.Println("goroutines at start:", runtime.NumGoroutine()) // 1 (main)

const N = 1000
var start sync.WaitGroup // gate: workers wait here
var done sync.WaitGroup  // barrier: main waits for finish
start.Add(1)

for i := 0; i < N; i++ {
	done.Add(1)
	go func() {
		defer done.Done()
		start.Wait() // park until main opens the gate
	}()
}

// All N workers are alive and parked on start.Wait(), plus main = N+1.
fmt.Println("goroutines with workers parked:", runtime.NumGoroutine() == N+1)

start.Done() // open the gate — every worker can now finish
done.Wait()
fmt.Println("all workers finished:", runtime.NumGoroutine() == 1) // back to just main
}

The output is deterministic across machines because we print booleans and counts that don’t depend on core countNumGoroutine() == N+1 is exact while the workers are parked, since none have returned yet.

When this matters in practice

  • Don’t fear goroutine count. 100,000 goroutines is normal; the scheduler is built for it. The limit is rarely the scheduler — it’s blocking work spawning Ms, or memory.
  • Bound blocking work. Each blocking syscall can park an M; thousands at once means thousands of threads. Cap concurrency with a worker pool or semaphore.
  • Never spin-wait. Park on a channel/context so the goroutine yields its P instead of burning a core (the gotcha above).
  • Set GOMAXPROCS from your cgroup on pre-1.25 runtimes. Otherwise containers over-subscribe.
  • CPU-bound stages scale to GOMAXPROCS, not beyond. More goroutines than Ps on pure CPU work just adds scheduling overhead — see concurrency vs parallelism.

✅ Treat goroutines as cheap, threads as precious

Goroutines are nearly free (see Goroutines); OS threads are not. The runtime grows the M pool whenever goroutines block in syscalls, and a burst of blocking calls can spawn many threads that linger. Keep blocking work bounded so the thread count stays sane, and prefer the netpoller-backed primitives (net, os pipes) that park goroutines instead of pinning threads.

Next: why those goroutine stacks are so cheap and the pauses so short — the garbage collector & stacks.

Check your understanding

Score: 0 / 5

1. What do G, M and P stand for in the scheduler?

The runtime multiplexes many goroutines (G) onto a smaller pool of OS threads (M), coordinated by Ps. A P is the permission-to-run-Go-code token: an M must hold a P to execute Go, and there are exactly GOMAXPROCS of them.

2. What is work-stealing?

To keep every P busy without a central bottleneck, an idle P first checks the global queue and the netpoller, then steals half of a randomly chosen victim P's local queue. Stealing half (not one) spreads the work so the victim isn't drained on the next steal.

3. A goroutine makes a blocking syscall (e.g. reading a file). What happens to its P?

On entering a blocking syscall the runtime detaches the P from the blocked M (handoff). An idle or new M picks up that P and runs its remaining goroutines, so one slow syscall never stalls a whole logical CPU. When the syscall returns, the M tries to reacquire a P.

4. Since Go 1.14, how can the runtime preempt a goroutine stuck in a tight loop with no function calls?

Before 1.14 preemption was purely cooperative (only at function-call/stack-growth checkpoints), so a calls-free loop could monopolize a P. Go 1.14 added asynchronous preemption: the sysmon thread signals a long-running goroutine, the signal handler parks it at a safe point, and the P is freed.

5. What does GOMAXPROCS control, and what is its default on Go 1.25 in a container?

GOMAXPROCS is the count of Ps, capping true parallelism (you can still have millions of goroutines). Through Go 1.24 the default was runtime.NumCPU() (host cores); Go 1.25 made it cgroup-aware on Linux, rounding the CPU quota up and never going below 2.

Comments

Sign in with GitHub to join the discussion.