Grafana仪表盘创建与管理Skill grafana-dashboards

Grafana仪表盘创建与管理技能专注于使用Grafana工具设计和部署实时监控仪表盘,用于可视化系统性能、应用指标和业务KPIs,提升运维效率和观测性。关键词:Grafana、仪表盘、监控、可视化、Prometheus、SLO、DevOps、实时数据、指标管理、运维观测性。

DevOps 0 次安装 0 次浏览 更新于 3/22/2026

名称: grafana-dashboards 描述: 创建和管理生产Grafana仪表盘,用于系统和应用指标的实时可视化。当构建监控仪表盘、可视化指标或创建操作观测性界面时使用。

Grafana仪表盘

创建和管理生产就绪的Grafana仪表盘,以实现全面的系统观测性。

目的

设计有效的Grafana仪表盘,用于监控应用、基础设施和业务指标。

何时使用

  • 可视化Prometheus指标
  • 创建自定义仪表盘
  • 实现SLO仪表盘
  • 监控基础设施
  • 跟踪业务KPIs

仪表盘设计原则

1. 信息层级

┌─────────────────────────────────────┐
│  关键指标(大数字)                 │
├─────────────────────────────────────┤
│  关键趋势(时间序列)               │
├─────────────────────────────────────┤
│  详细指标(表格/热图)              │
└─────────────────────────────────────┘

2. RED方法(服务)

  • 速率 - 每秒请求数
  • 错误 - 错误率
  • 持续时间 - 延迟/响应时间

3. USE方法(资源)

  • 利用率 - 资源忙碌时间百分比
  • 饱和度 - 队列长度/等待时间
  • 错误 - 错误计数

仪表盘结构

API监控仪表盘

{
  "dashboard": {
    "title": "API监控",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "请求速率",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "错误率%",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "错误率"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [5], "type": "gt" },
              "operator": { "type": "and" },
              "query": { "params": ["A", "5m", "now"] },
              "type": "query"
            }
          ]
        },
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "P95延迟",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      }
    ]
  }
}

参考: 参见 assets/api-dashboard.json

面板类型

1. 统计面板(单值)

{
  "type": "stat",
  "title": "总请求数",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

2. 时间序列图

{
  "type": "graph",
  "title": "CPU使用率",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "yaxes": [
    { "format": "percent", "max": 100, "min": 0 },
    { "format": "short" }
  ]
}

3. 表格面板

{
  "type": "table",
  "title": "服务状态",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "indexByName": {},
        "renameByName": {
          "instance": "实例",
          "job": "服务",
          "Value": "状态"
        }
      }
    }
  ]
}

4. 热图

{
  "type": "heatmap",
  "title": "延迟热图",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

变量

查询变量

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

在查询中使用变量

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

仪表盘中的告警

{
  "alert": {
    "name": "高错误率",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": { "type": "and" },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "错误率超过5%",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-channel" }]
  }
}

仪表盘配置

dashboards.yml:

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "General"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

常见仪表盘模式

基础设施仪表盘

关键面板:

  • 每个节点的CPU利用率
  • 每个节点的内存使用情况
  • 磁盘I/O
  • 网络流量
  • 按命名空间的Pod计数
  • 节点状态

参考: 参见 assets/infrastructure-dashboard.json

数据库仪表盘

关键面板:

  • 每秒查询数
  • 连接池使用情况
  • 查询延迟(P50, P95, P99)
  • 活动连接数
  • 数据库大小
  • 复制延迟
  • 慢查询

参考: 参见 assets/database-dashboard.json

应用仪表盘

关键面板:

  • 请求速率
  • 错误率
  • 响应时间(百分位数)
  • 活动用户/会话数
  • 缓存命中率
  • 队列长度

最佳实践

  1. 从模板开始(Grafana社区仪表盘)
  2. 使用一致的命名 对于面板和变量
  3. 分组相关指标 在行中
  4. 设置合适的时间范围(默认:最近6小时)
  5. 使用变量 以增加灵活性
  6. 添加面板描述 以提供上下文
  7. 正确配置单位
  8. 设置有意义的阈值 对于颜色
  9. 在仪表盘之间使用一致的颜色
  10. 使用不同的时间范围测试

仪表盘即代码

Terraform配置

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "生产监控"
}

Ansible配置

- name: 部署Grafana仪表盘
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

参考文件

  • assets/api-dashboard.json - API监控仪表盘
  • assets/infrastructure-dashboard.json - 基础设施仪表盘
  • assets/database-dashboard.json - 数据库监控仪表盘
  • references/dashboard-design.md - 仪表盘设计指南

相关技能

  • prometheus-configuration - 用于指标收集
  • slo-implementation - 用于SLO仪表盘