Skill

Review

Audit score 70

grafana-dashboards

wshobson/agents

Create and manage production Grafana dashboards for real-time visualization of system and application metrics.

What is grafana-dashboards?

Build effective Grafana dashboards to monitor applications, infrastructure, and business metrics using Prometheus data. Use this skill when you need to visualize system performance, create operational observability interfaces, or implement SLO dashboards.

Design dashboards following RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) methods
Create panel types including stat panels, time series graphs, tables, and heatmaps with Prometheus queries
Configure dashboard variables for dynamic filtering by namespace, service, and other labels
Set up alerting rules within dashboards with thresholds and notification channels
Provision dashboards as code using Terraform or Ansible for infrastructure automation
Implement dashboard patterns for API monitoring, infrastructure, database, and application observability

How to install grafana-dashboards

npx skills add https://github.com/wshobson/agents --skill grafana-dashboards

Prerequisites

Grafana instance running and accessible
Prometheus data source configured in Grafana
Prometheus metrics being collected from your applications or infrastructure

Claude Code

Cursor

Windsurf

Cline

How to use grafana-dashboards

1.Design your dashboard structure using the hierarchy of information principle (critical metrics, trends, detailed metrics)
2.Choose appropriate panel types (stat, graph, table, heatmap) based on your metric visualization needs
3.Write Prometheus queries for each panel using PromQL expressions
4.Add dashboard variables to enable dynamic filtering and multi-select options
5.Configure alert conditions on critical panels with thresholds and notification channels
6.Test the dashboard with different time ranges and variable combinations
7.Provision the dashboard JSON using Terraform, Ansible, or Grafana's file provisioning

Use cases

Good for

Monitor API request rates, error rates, and latency percentiles across services
Track infrastructure metrics like CPU, memory, disk I/O, and network traffic per node
Visualize database performance including queries per second, connection pools, and replication lag
Create SLO dashboards to track service level objectives and error budgets
Build application dashboards showing request rates, response times, cache hit rates, and active sessions

Who it's for

DevOps engineers building monitoring infrastructure
SRE teams implementing observability and alerting
Backend engineers creating operational dashboards
Platform teams provisioning dashboards for multiple services
Operations teams monitoring production systems

grafana-dashboards FAQ

What metrics should I prioritize on a dashboard?

Start with the RED method for services (Rate, Errors, Duration) and USE method for resources (Utilization, Saturation, Errors). Place critical metrics at the top as big numbers, trends in the middle as time series, and detailed metrics at the bottom as tables or heatmaps.

How do I make dashboards reusable across multiple services?

Use dashboard variables for dynamic values like namespace, service name, and instance. Reference these variables in your Prometheus queries using the $variable syntax, then users can filter the dashboard for different services.

Can I set up alerts directly in Grafana dashboards?

Yes, you can configure alert conditions on individual panels with evaluators, thresholds, and notification channels. Set the evaluation frequency and duration before alerting triggers.

How should I organize multiple dashboards?

Group related dashboards in folders by domain (e.g., API Monitoring, Infrastructure, Databases). Use consistent naming conventions and link related dashboards together for easy navigation.

What's the best way to provision dashboards across environments?

Use Terraform or Ansible to provision dashboards as code from JSON files. This enables version control, consistent deployment across dev/staging/production, and easy rollback of changes.

Full instructions (SKILL.md)

Source of truth, from wshobson/agents.

name: grafana-dashboards description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

Grafana Dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

When to Use

Visualize Prometheus metrics
Create custom dashboards
Implement SLO dashboards
Monitor infrastructure
Track business KPIs

Dashboard Design Principles

1. Hierarchy of Information

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

2. RED Method (Services)

Rate - Requests per second
Errors - Error rate
Duration - Latency/response time

3. USE Method (Resources)

Utilization - % time resource is busy
Saturation - Queue length/wait time
Errors - Error count

Dashboard Structure

API Monitoring Dashboard

{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [5], "type": "gt" },
              "operator": { "type": "and" },
              "query": { "params": ["A", "5m", "now"] },
              "type": "query"
            }
          ]
        },
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      }
    ]
  }
}

Reference: See assets/api-dashboard.json

Panel Types

1. Stat Panel (Single Value)

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

2. Time Series Graph

{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "yaxes": [
    { "format": "percent", "max": 100, "min": 0 },
    { "format": "short" }
  ]
}

3. Table Panel

{
  "type": "table",
  "title": "Service Status",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}

4. Heatmap

{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

Variables

Query Variables

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

Use Variables in Queries

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

Alerts in Dashboards

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": { "type": "and" },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-channel" }]
  }
}

Dashboard Provisioning

dashboards.yml:

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "General"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

Common Dashboard Patterns

Infrastructure Dashboard

Key Panels:

CPU utilization per node
Memory usage per node
Disk I/O
Network traffic
Pod count by namespace
Node status

Reference: See assets/infrastructure-dashboard.json

Database Dashboard

Key Panels:

Queries per second
Connection pool usage
Query latency (P50, P95, P99)
Active connections
Database size
Replication lag
Slow queries

Reference: See assets/database-dashboard.json

Application Dashboard

Key Panels:

Request rate
Error rate
Response time (percentiles)
Active users/sessions
Cache hit rate
Queue length

Best Practices

Start with templates (Grafana community dashboards)
Use consistent naming for panels and variables
Group related metrics in rows
Set appropriate time ranges (default: Last 6 hours)
Use variables for flexibility
Add panel descriptions for context
Configure units correctly
Set meaningful thresholds for colors
Use consistent colors across dashboards
Test with different time ranges

Dashboard as Code

Terraform Provisioning

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}

Ansible Provisioning

- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

Related Skills

prometheus-configuration - For metric collection
slo-implementation - For SLO dashboards

Related skills

More from wshobson/agents and the wider catalog.

tailwind-design-system

wshobson/agents

Build production-ready design systems with Tailwind CSS v4, design tokens, and component libraries.

52k installsAudited

typescript-advanced-types

wshobson/agents

Master TypeScript's advanced type system: generics, conditional types, mapped types, and utility types for type-safe applications.

51k installsAudited

nodejs-backend-patterns

wshobson/agents

Build production-ready Node.js backends with Express/Fastify, middleware patterns, auth, and database integration.

38k installsAudited

python-performance-optimization

wshobson/agents

Profile and optimize Python code using cProfile, memory profilers, and performance best practices.

28k installsAudited

brand-landingpage

wshobson/agents

Brand-first landing page designer with guided interviews and Stitch-powered iteration.

26k installsAudited

python-testing-patterns

wshobson/agents

Implement comprehensive testing strategies with pytest, fixtures, mocking, and test-driven development.

26k installsAudited