golang-observability
samber/cc-skills-golang
Production observability for Go services: structured logging, metrics, tracing, profiling, and RUM in one skill.
What is golang-observability?
Comprehensive observability toolkit for Go services covering slog structured logging, Prometheus metrics, OpenTelemetry distributed tracing, pprof/Pyroscope continuous profiling, and server-side RUM event tracking. Use when instrumenting Go services for production monitoring, setting up metrics and alerting, adding tracing, correlating logs with traces, migrating legacy loggers, or implementing GDPR/CCPA-compliant tracking.
- Structured logging with log/slog for production-grade JSON logs and context correlation
- Prometheus metrics collection with Histograms, Counters, and Gauges for alerting and SLOs
- OpenTelemetry distributed tracing to track request flows across services and identify latency bottlenecks
- Continuous profiling with pprof and Pyroscope for CPU, memory, and lock contention analysis
- Server-side RUM event tracking and GDPR/CCPA-compliant customer data platform integration
- Grafana dashboard setup and multi-window SLO burn rate alerting with ~500 ready-to-use rules
How to install golang-observability
npx skills add https://github.com/samber/cc-skills-golang --skill golang-observability- Go installed (go binary required)
- Familiarity with Go context propagation patterns
- Access to observability backends (Prometheus, Grafana, Jaeger/Tempo for tracing, Pyroscope for continuous profiling)
How to use golang-observability
- 1.Choose your observability signals: start with structured logging (slog) and Prometheus metrics, add OpenTelemetry tracing for request correlation
- 2.Configure slog with JSON handler and context propagation; use slog.InfoContext(ctx, ...) to inject trace IDs into logs
- 3.Declare Prometheus metrics (Histograms for latency, Counters for rates) at package level with PromQL-as-comments for discoverability
- 4.Set up OpenTelemetry TracerProvider early in main(); add spans to service methods, DB queries, and external API calls
- 5.Enable pprof via environment variables and secure the endpoint with authentication; optionally configure Pyroscope for continuous profiling
- 6.Create Grafana dashboards querying your metrics; set up multi-window SLO burn rate alerts using awesome-prometheus-alerts rules
Use cases
- Instrument a new Go microservice with all five signals (logs, metrics, traces, profiles, RUM) before production deployment
- Migrate an existing service from zap/logrus/zerolog to slog while maintaining trace correlation
- Add OpenTelemetry spans to HTTP handlers, database queries, and external API calls to diagnose slow requests
- Set up Prometheus metrics and alerting rules for a service using awesome-prometheus-alerts templates
- Enable pprof profiling in production via environment variables to investigate CPU or memory issues without redeploying
- Go backend engineers building production services
- DevOps/SRE teams setting up monitoring and alerting infrastructure
- Engineering leads ensuring observability is built into new features from the start
- Teams migrating from legacy logging frameworks to structured logging
golang-observability FAQ
Always prefer Histogram over Summary. Histograms support server-side aggregation with histogram_quantile() in PromQL and allow percentile queries without storing raw data. Summaries require client-side computation and cannot be aggregated across instances.
Use slog.InfoContext(ctx, ...) to emit logs with context. Configure your slog handler to extract trace_id and span_id from the context and inject them as structured log fields. This allows log aggregation systems to link logs to their corresponding traces.
Yes. Enable pprof via environment variables (e.g., ENABLE_PPROF=true) checked at startup. For continuous profiling, use Pyroscope which can be toggled via configuration without code changes. Always secure pprof endpoints with authentication.
Keep label cardinality low (typically <10 unique values per label). Never use unbounded values like user IDs, full URLs, or request paths as label values—this causes memory exhaustion and query slowdown. Use fixed dimensions (service, endpoint, method, status_code) instead.
Gradually replace logger calls with slog equivalents. Use slog.InfoContext(ctx, ...) for context-aware logging. For Go 1.26+, use slog.NewMultiHandler to fan-out to multiple handlers if needed. Refer to the Structured Logging guide for migration patterns and handler ecosystem options.
Full instructions (SKILL.md)
Source of truth, from samber/cc-skills-golang.
name: golang-observability
description: "Golang everyday observability — the always-on signals in production. Covers structured logging with slog, Prometheus metrics, OpenTelemetry distributed tracing, continuous profiling with pprof/Pyroscope, server-side RUM event tracking, alerting, and Grafana dashboards. Apply when instrumenting Go services for production monitoring, setting up metrics or alerting, adding OpenTelemetry tracing, correlating logs with traces, migrating legacy loggers (zap/logrus/zerolog) to slog, adding observability to new features, or implementing GDPR/CCPA-compliant tracking with Customer Data Platforms (CDP). Not for temporary deep-dive performance investigation (→ See samber/cc-skills-golang@golang-benchmark and samber/cc-skills-golang@golang-performance skills)."
user-invocable: true
license: MIT
compatibility: Designed for Claude Code or similar AI coding agents, and for projects using Golang.
metadata:
author: samber
version: "1.2.1"
openclaw:
emoji: "📡"
homepage: https://github.com/samber/cc-skills-golang
requires:
bins:
- go
install: []
allowed-tools: Read Edit Write Glob Grep Bash(go:) Bash(golangci-lint:) Bash(git:*) Agent WebFetch WebSearch AskUserQuestion
Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.
Modes:
- Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
- Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
- Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.
Community default. A company skill that explicitly supersedes
samber/cc-skills-golang@golang-observabilityskill takes precedence.
Go Observability Best Practices
Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.
When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.
Best Practices Summary
- Use structured logging with
log/slog— production services MUST emit structured logs (JSON), not freeform strings - Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
- Log with context — use
slog.InfoContext(ctx, ...)to correlate logs with traces - Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
- Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
- Track percentiles (P50, P90, P99, P99.9) using Histograms +
histogram_quantile()in PromQL - Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
- Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
- Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries
- Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
- Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces
- A feature is not done until it is observable — declare metrics, add proper logging, create spans
- awesome-prometheus-alerts provides ~500 ready-to-use alerting rules organized by technology for infrastructure and dependency monitoring
Cross-References
See samber/cc-skills-golang@golang-error-handling skill for the single handling rule. See samber/cc-skills-golang@golang-troubleshooting skill for using observability signals to diagnose production issues. See samber/cc-skills-golang@golang-security skill for protecting pprof endpoints and avoiding PII in logs. See samber/cc-skills-golang@golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.
Go 1.26+: slog multi-handler
For simple fan-out to multiple slog handlers, prefer stdlib slog.NewMultiHandler before adding third-party handler-composition dependencies.
logger := slog.New(slog.NewMultiHandler(
slog.NewJSONHandler(os.Stdout, nil),
auditHandler,
))
Use third-party slog handler libraries only when the stdlib handler composition is insufficient.
The Five Signals
| Signal | Question it answers | Tool | When to use |
|---|---|---|---|
| Logs | What happened? | log/slog | Discrete events, errors, audit trails |
| Metrics | How much / how fast? | Prometheus client | Aggregated measurements, alerting, SLOs |
| Traces | Where did time go? | OpenTelemetry | Request flow across services, latency breakdown |
| Profiles | Why is it slow / using memory? | pprof, Pyroscope | CPU hotspots, memory leaks, lock contention |
| RUM | How do users experience it? | PostHog, Segment | Product analytics, funnels, session replay |
Detailed Guides
Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:
-
Structured Logging — Why structured logging matters for log aggregation at scale. Covers
log/slogsetup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation withslog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog. -
Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports
histogram_quantilePromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance). -
Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording),
otelhttpmiddleware for HTTP instrumentation, error recording withspan.RecordError(), trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization. -
Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
-
Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
-
Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts provides ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using
irateinstead ofrate, missingfor:duration to avoid flapping). -
Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.
Correlating Signals
Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.
Logs + Traces: otelslog bridge
import "go.opentelemetry.io/contrib/bridges/otelslog"
// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))
// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}
Metrics + Traces: Exemplars
// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
obs := histogram.WithLabelValues("POST", "/orders")
if eo, ok := obs.(prometheus.ExemplarObserver); ok {
eo.ObserveWithExemplar(duration, prometheus.Labels{"trace_id": traceID})
} else {
obs.Observe(duration)
}
Migrating Legacy Loggers
If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.
Migration strategy:
- Add
slogas the new logger withslog.SetDefault() - Bridge handlers during migration route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog
- Gradually replace all
zap.L().Info(...)/logrus.Info(...)/log.Info().Msg(...)calls withslog.Info(...) - Once fully migrated, remove the bridge handler and the old logger dependency
Definition of Done for Observability
A feature is not production-ready until it is observable. Before marking a feature as done, verify:
- Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
- Logging is proper — structured key-value pairs with
slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both). - Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with
span.RecordError(). - Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Ready-to-use alert rules for common infrastructure dependencies are available at awesome-prometheus-alerts.
- RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is
user_id(not email), consent checked before tracking.
Common Mistakes
// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
slog.Error("query failed", "error", err)
return fmt.Errorf("query: %w", err)
}
// ✓ Good — return with context, log once at the top level
if err != nil {
return fmt.Errorf("querying users: %w", err)
}
// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")
// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")
// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Objectives: map[float64]float64{0.99: 0.001},
})
// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})
Related skills
More from samber/cc-skills-golang and the wider catalog.
golang-code-style
Go code style conventions for clarity, control flow, and readability—line breaking, variable declarations, and when comments help.
golang-error-handling
Idiomatic Go error handling: wrapping, inspection, structured logging, and production-grade error tracking.
golang-performance
Go performance optimization patterns: identify bottlenecks with profiling, then apply the right fix.
golang-design-patterns
Idiomatic Go design patterns: functional options, constructors, error handling, resource lifecycle, graceful shutdown, and resilience.
golang-testing
Production-ready Go tests with table-driven patterns, testify integration, parallel execution, fuzzing, and leak detection.
golang-security
Security best practices and vulnerability prevention for Go code—injection, crypto, secrets, and authentication.