名称: grafana-dashboards 描述: 创建和管理生产Grafana仪表盘,用于系统和应用指标的实时可视化。当构建监控仪表盘、可视化指标或创建操作观测性界面时使用。
Grafana仪表盘
创建和管理生产就绪的Grafana仪表盘,以实现全面的系统观测性。
目的
设计有效的Grafana仪表盘,用于监控应用、基础设施和业务指标。
何时使用
- 可视化Prometheus指标
- 创建自定义仪表盘
- 实现SLO仪表盘
- 监控基础设施
- 跟踪业务KPIs
仪表盘设计原则
1. 信息层级
┌─────────────────────────────────────┐
│ 关键指标(大数字) │
├─────────────────────────────────────┤
│ 关键趋势(时间序列) │
├─────────────────────────────────────┤
│ 详细指标(表格/热图) │
└─────────────────────────────────────┘
2. RED方法(服务)
- 速率 - 每秒请求数
- 错误 - 错误率
- 持续时间 - 延迟/响应时间
3. USE方法(资源)
- 利用率 - 资源忙碌时间百分比
- 饱和度 - 队列长度/等待时间
- 错误 - 错误计数
仪表盘结构
API监控仪表盘
{
"dashboard": {
"title": "API监控",
"tags": ["api", "production"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "请求速率",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
},
{
"title": "错误率%",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "错误率"
}
],
"alert": {
"conditions": [
{
"evaluator": { "params": [5], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["A", "5m", "now"] },
"type": "query"
}
]
},
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
},
{
"title": "P95延迟",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}"
}
],
"gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
}
]
}
}
参考: 参见 assets/api-dashboard.json
面板类型
1. 统计面板(单值)
{
"type": "stat",
"title": "总请求数",
"targets": [
{
"expr": "sum(http_requests_total)"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value"
},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "green" },
{ "value": 80, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
}
}
2. 时间序列图
{
"type": "graph",
"title": "CPU使用率",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"yaxes": [
{ "format": "percent", "max": 100, "min": 0 },
{ "format": "short" }
]
}
3. 表格面板
{
"type": "table",
"title": "服务状态",
"targets": [
{
"expr": "up",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": { "Time": true },
"indexByName": {},
"renameByName": {
"instance": "实例",
"job": "服务",
"Value": "状态"
}
}
}
]
}
4. 热图
{
"type": "heatmap",
"title": "延迟热图",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}
],
"dataFormat": "tsbuckets",
"yAxis": {
"format": "s"
}
}
变量
查询变量
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
"refresh": 1,
"multi": true
}
]
}
}
在查询中使用变量
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
仪表盘中的告警
{
"alert": {
"name": "高错误率",
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": { "type": "and" },
"query": {
"params": ["A", "5m", "now"]
},
"reducer": { "type": "avg" },
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"message": "错误率超过5%",
"noDataState": "no_data",
"notifications": [{ "uid": "slack-channel" }]
}
}
仪表盘配置
dashboards.yml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "General"
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboards
常见仪表盘模式
基础设施仪表盘
关键面板:
- 每个节点的CPU利用率
- 每个节点的内存使用情况
- 磁盘I/O
- 网络流量
- 按命名空间的Pod计数
- 节点状态
参考: 参见 assets/infrastructure-dashboard.json
数据库仪表盘
关键面板:
- 每秒查询数
- 连接池使用情况
- 查询延迟(P50, P95, P99)
- 活动连接数
- 数据库大小
- 复制延迟
- 慢查询
参考: 参见 assets/database-dashboard.json
应用仪表盘
关键面板:
- 请求速率
- 错误率
- 响应时间(百分位数)
- 活动用户/会话数
- 缓存命中率
- 队列长度
最佳实践
- 从模板开始(Grafana社区仪表盘)
- 使用一致的命名 对于面板和变量
- 分组相关指标 在行中
- 设置合适的时间范围(默认:最近6小时)
- 使用变量 以增加灵活性
- 添加面板描述 以提供上下文
- 正确配置单位
- 设置有意义的阈值 对于颜色
- 在仪表盘之间使用一致的颜色
- 使用不同的时间范围测试
仪表盘即代码
Terraform配置
resource "grafana_dashboard" "api_monitoring" {
config_json = file("${path.module}/dashboards/api-monitoring.json")
folder = grafana_folder.monitoring.id
}
resource "grafana_folder" "monitoring" {
title = "生产监控"
}
Ansible配置
- name: 部署Grafana仪表盘
copy:
src: "{{ item }}"
dest: /etc/grafana/dashboards/
with_fileglob:
- "dashboards/*.json"
notify: restart grafana
参考文件
assets/api-dashboard.json- API监控仪表盘assets/infrastructure-dashboard.json- 基础设施仪表盘assets/database-dashboard.json- 数据库监控仪表盘references/dashboard-design.md- 仪表盘设计指南
相关技能
prometheus-configuration- 用于指标收集slo-implementation- 用于SLO仪表盘