Observability · Cloud-Native · Intermediate

Distributed Tracing

Following one request across many services — traces and spans, context propagation, OpenTelemetry, sampling, and why a trace is the tool you reach for when metrics say 'slow' but not 'where'.

Observability Intermediate ⏱ 5 min read Complete

📦 Analogy

Tracing a request across microservices is package tracking for your data. A metric tells you “average delivery is slow this week”; a log tells you “the Memphis hub scanned a package.” But a tracking number follows one parcel through every hub, with a timestamp at each — so you see it sat in Memphis for two days. A trace is that tracking number for a request: one ID, a timestamped scan (span) at every service, assembled into the full journey so you can see exactly where it got stuck.

Traces and spans

A trace is one request’s entire journey, sharing a single trace ID. A span is one timed unit of work within it — handling the request, calling a downstream service, running a query — each with a span ID, a parent, a duration, and attributes. Spans nest into a waterfall:

graph TD
A["span: GET /checkout<br/>(gateway) — 240ms"] --> B["span: auth.Verify<br/>15ms"]
A --> C["span: orders.Create<br/>200ms"]
C --> D["span: db.INSERT<br/>30ms"]
C --> E["span: payments.Charge<br/>160ms ← the slow hop"]

That waterfall instantly shows payments.Charge ate most of the 240ms — something neither a metric (“/checkout is slow”) nor a log (one service’s events) could pinpoint.

See it: context propagation builds the trace

The mechanism is context propagation: the trace/span IDs ride in context.Context locally and in headers across the network. This runs here — a simplified tracer threading IDs through nested calls to build the span tree:

▶ trace.go — editable & runnable

package main

import (
"context"
"fmt"
)

type traceKey struct{}
type spanCtx struct {
traceID string
spanID  string
depth   int
}

// startSpan derives a child span from whatever is in the context.
func startSpan(ctx context.Context, name string) (context.Context, func()) {
parent, _ := ctx.Value(traceKey{}).(spanCtx)
if parent.traceID == "" {
	parent.traceID = "trace-7f3a" // root: would be random in real code
}
child := spanCtx{traceID: parent.traceID, spanID: name, depth: parent.depth + 1}
fmt.Printf("%*sspan start: %-16s trace=%s\n", child.depth*2, "", name, child.traceID)
ctx = context.WithValue(ctx, traceKey{}, child)
return ctx, func() { fmt.Printf("%*sspan end:   %s\n", child.depth*2, "", name) }
}

func main() {
ctx, end := startSpan(context.Background(), "GET /checkout")
defer end()

ctx2, end2 := startSpan(ctx, "orders.Create")
// payments.Charge is a CHILD of orders.Create — same trace ID propagates.
_, end3 := startSpan(ctx2, "payments.Charge")
end3()
end2()
}

Every span shares the one trace=trace-7f3a because the ID propagates through ctx. Across the network, the client injects it into the W3C traceparent header and the server extracts it — so spans from different services join the same trace. This is the formalized version of the correlation IDs from logging.

OpenTelemetry does this for real

You don’t hand-roll tracers. OpenTelemetry (opentelemetry.io) is the vendor-neutral standard: instrument once, export to any backend (Jaeger, Tempo, Datadog). Its Go libraries auto-instrument net/http, gRPC, and database/sql (fenced — third-party SDK):

// otelhttp wraps a handler: it extracts traceparent, starts a server span,
// and puts the context (with trace IDs) into r.Context() for you.
handler := otelhttp.NewHandler(mux, "api")

// In code, create child spans and add attributes:
ctx, span := tracer.Start(ctx, "payments.Charge")
defer span.End()
span.SetAttributes(attribute.Int("amount", amount))
// pass ctx to the next call — propagation is automatic.

At scale you sample: keep ~1% of traces plus 100% of slow/errored ones, so you keep the signal without the storage cost. The sampling decision rides in the trace context so all services agree.

🐹 Pass ctx everywhere — it's the propagation backbone

Tracing only works if the trace context flows through your whole call path, which is exactly why Go’s convention is ctx context.Context as the first argument of every request-scoped function. The context that carries cancellation and deadlines also carries the trace and span IDs. Thread it from the HTTP handler through your business logic to the database and downstream clients, and OpenTelemetry’s instrumentation creates correctly-parented spans automatically. A function that drops ctx breaks the trace (and cancellation) for everything below it.

⚠️ A trace is incomplete if any hop doesn't propagate

Tracing breaks silently at the weakest link: if one service (or one client call) fails to propagate the trace context — a handler that ignores ctx, an HTTP client without the otel transport, a message published without injecting the headers — the trace splits, and you get a truncated waterfall that hides the very hop you’re hunting. Use the auto-instrumentation for net/http/gRPC/SQL, propagate context across async message boundaries too (inject traceparent into the message), and verify end-to-end traces in staging. A trace is only as complete as its least-instrumented service.

Check your understanding

Score: 0 / 5

1. What problem does distributed tracing solve that logs and metrics don't?

Metrics tell you the service is slow; logs show events within one service. Neither follows a single request as it fans out across gateway → auth → orders → payments → db. A trace does: each service records spans tied to one trace ID, assembled into a waterfall that shows exactly which downstream call ate the latency or returned the error.

2. What are a trace and a span?

A trace = the full request, sharing one trace ID. A span = a timed operation within it (handle the HTTP request, call payments, run a query), each with a span ID, a parent span ID, a duration, and attributes. Spans nest into a tree/waterfall, so you see both the structure and the timing of the request across services.

3. How does a trace stay connected as a request crosses a service boundary?

Within a process the trace context rides in context.Context; across the network it's injected into headers (the W3C traceparent header) by the client and extracted by the server, which starts a child span under the same trace ID. That propagation is what links spans from different services into one trace — and it's why passing ctx everywhere matters.

4. What is sampling in tracing, and why use it?

Tracing every request at scale generates enormous data. Sampling keeps a representative or interesting subset — e.g. 1% of all traces plus 100% of slow/errored ones (tail-based sampling) — so you keep the signal while controlling storage and cost. The trace context carries the sampling decision so all services in one trace agree to record or skip it.

5. What is OpenTelemetry (OTel)?

OpenTelemetry is the CNCF standard for telemetry: a common API/SDK (with a Go implementation) and wire format (OTLP) so you instrument your code once and export to whatever backend you choose, avoiding vendor lock-in. Its libraries auto-instrument net/http, gRPC, and database/sql, so much of the propagation and span creation is handled for you.

Sync across devices

Distributed Tracing

Traces and spans

See it: context propagation builds the trace

OpenTelemetry does this for real

See also

Check your understanding

Comments

Traces and spans

See it: context propagation builds the trace

OpenTelemetry does this for real

See also

Related topics

Check your understanding

Comments