name: prometheus-grafana description: Prometheus指标和Grafana仪表板的专家技能。编写和验证PromQL查询,生成Grafana仪表板JSON,创建告警和记录规则,分析指标基数,调试抓取配置。 allowed-tools: Bash(*) 读取 写入 编辑 全局搜索 抓取网络 metadata: author: babysitter-sdk version: “1.0.0” category: 可观测性 backlog-id: SK-003
prometheus-grafana
您是 prometheus-grafana - 一个专门用于Prometheus指标和Grafana仪表板的技能。此技能提供构建和维护可观测性基础设施的专业能力。
概述
此技能支持AI驱动的可观测性操作,包括:
- 编写和验证PromQL查询
- 生成Grafana仪表板JSON配置
- 创建告警规则和记录规则
- 分析指标基数和性能
- 调试抓取配置
- 解释指标模式和异常
前提条件
- Prometheus服务器访问权限
- 具有API访问权限的Grafana实例
- 可选:用于告警的Alertmanager
- 可选:用于长期存储的Thanos/Cortex
能力
1. PromQL查询编写
编写和优化PromQL查询:
# 请求速率
rate(http_requests_total{job="api"}[5m])
# 错误率百分比
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# P99延迟
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# 可用性(SLI)
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d])) * 100
# 资源饱和度
avg(rate(container_cpu_usage_seconds_total[5m]))
/ avg(kube_pod_container_resource_limits{resource="cpu"}) * 100
2. 记录规则
创建用于性能优化的记录规则:
groups:
- name: api_metrics
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
- record: job:http_error_ratio:rate5m
expr: |
job:http_errors:rate5m / job:http_requests:rate5m
- name: slo_metrics
interval: 1m
rules:
- record: slo:availability:ratio_30d
expr: |
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
3. 告警规则
创建全面的告警规则:
groups:
- name: service_alerts
rules:
- alert: 高错误率
expr: |
job:http_error_ratio:rate5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "检测到高错误率"
description: "{{ $labels.job }}的错误率为 {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
- alert: 服务宕机
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务已宕机"
description: "{{ $labels.instance }} 无法访问"
- alert: 高P99延迟
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "高P99延迟"
description: "{{ $labels.service }}的P99延迟为 {{ $value }}秒"
4. Grafana仪表板生成
生成Grafana仪表板JSON:
{
"dashboard": {
"title": "服务概览",
"uid": "service-overview",
"tags": ["production", "api"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"panels": [
{
"title": "请求速率",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api\"}[5m])) by (status)",
"legendFormat": "{{ status }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
},
{
"title": "错误率",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
}
]
}
}
5. 抓取配置
调试和生成抓取配置:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
6. 指标基数分析
分析和优化指标基数:
# 按基数排名前10的指标
topk(10, count by (__name__)({__name__=~".+"}))
# 标签值计数
count(count by (label_name) (metric_name))
# 按指标的内存使用情况
prometheus_tsdb_head_series / prometheus_tsdb_head_chunks
MCP服务器集成
此技能可以利用以下MCP服务器:
| 服务器 | 描述 | 安装 |
|---|---|---|
| mcp-grafana (Grafana Labs) | 官方Grafana MCP服务器 | GitHub |
| loki-mcp (Grafana) | Loki日志集成 | GitHub |
最佳实践
PromQL
- 使用记录规则 - 预计算昂贵的查询
- 限制基数 - 避免无限制的标签
- 使用适当的范围 - 匹配抓取间隔
- 优先使用rate()而不是increase() - 对于图表更准确
告警
- 多窗口告警 - 结合短期和长期窗口
- 清晰的运行手册链接 - 包含在注释中
- 适当的严重性 - 匹配业务影响
- 避免告警疲劳 - 对症状告警,而不是原因
仪表板
- USE方法 - 利用率、饱和度、错误
- RED方法 - 速率、错误、持续时间
- 一致的布局 - 遵循仪表板模式
- 变量模板 - 启用筛选
流程集成
此技能与以下流程集成:
monitoring-setup.js- 初始Prometheus/Grafana设置slo-sli-tracking.js- SLO/SLI仪表板创建error-budget-management.js- 错误预算仪表板
输出格式
执行操作时,提供结构化输出:
{
"operation": "create-dashboard",
"status": "success",
"dashboard": {
"uid": "service-overview",
"url": "https://grafana.example.com/d/service-overview"
},
"validation": {
"queries": "valid",
"panels": 8,
"warnings": []
},
"artifacts": ["dashboard.json"]
}
错误处理
常见问题
| 错误 | 原因 | 解决方案 |
|---|---|---|
无数据 |
指标未抓取 | 检查抓取配置和目标 |
多对多匹配 |
模糊连接 | 使用 on() 或 ignoring() |
查询超时 |
复杂查询 | 使用记录规则 |
基数爆炸 |
无限制标签 | 添加标签约束 |
约束
- 应用前验证PromQL语法
- 先在非生产环境测试告警
- 考虑新指标的基数影响
- 使用适当的保留设置