名称: prometheus 描述: | Prometheus监控和告警用于云原生可观测性。
使用场景: 编写PromQL查询、配置Prometheus抓取目标、创建告警规则、设置记录规则、使用Prometheus指标检测应用、配置服务发现。 不要使用: 用于构建仪表板(使用/grafana)、日志分析(使用/logging-observability)、一般可观测性架构(使用具有基础设施重点的高级软件工程师)。
触发词: 指标、prometheus、promql、计数器、仪表、直方图、摘要、告警、alertmanager、告警规则、记录规则、抓取、目标、标签、服务发现、重新标记、导出器、检测、slo、错误预算。 触发词:
- 指标
- prometheus
- promql
- 计数器
- 仪表
- 直方图
- 摘要
- 告警
- alertmanager
- 告警规则
- 记录规则
- 抓取
- 目标
- 标签
- 服务发现
- 重新标记
- 导出器
- 检测
- slo
- 错误预算 允许工具: Read, Grep, Glob, Edit, Write, Bash
Prometheus监控与告警
概述
Prometheus是一个强大的开源监控和告警系统,设计用于云原生环境中的可靠性和可扩展性。构建用于多维时间序列数据,通过PromQL进行灵活查询。
架构组件
- Prometheus服务器: 核心组件,抓取和存储时间序列数据,使用本地TSDB
- Alertmanager: 处理告警、去重、分组、路由和通知接收器
- Pushgateway: 允许临时作业推送指标(谨慎使用 - 优先拉取模型)
- 导出器: 将第三方系统的指标转换为Prometheus格式(节点、黑盒等)
- 客户端库: 检测应用代码(Go、Java、Python、Rust等)
- Prometheus操作符: 通过CRD进行Kubernetes原生部署和管理
- 远程存储: 通过Thanos、Cortex、Mimir实现多集群联合的长期存储
数据模型
- 指标: 时间序列数据,由指标名称和键值标签标识
- 格式:
metric_name{label1="value1", label2="value2"} sample_value timestamp - 指标类型:
- 计数器: 单调递增的值(请求、错误) - 使用
rate()或increase()进行查询 - 仪表: 可以上升或下降的值(温度、内存使用、队列长度)
- 直方图: 在可配置桶中的观察(延迟、请求大小) - 暴露
_bucket、_sum、_count - 摘要: 类似于直方图,但客户端计算分位数 - 使用直方图进行聚合
- 计数器: 单调递增的值(请求、错误) - 使用
设置和配置
基本Prometheus服务器配置
# prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-east-1"
# Alertmanager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 加载规则文件
rule_files:
- "alerts/*.yml"
- "rules/*.yml"
# 抓取配置
scrape_configs:
# Prometheus自身
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# 应用服务
- job_name: "application"
metrics_path: "/metrics"
static_configs:
- targets:
- "app-1:8080"
- "app-2:8080"
labels:
env: "production"
team: "backend"
# Kubernetes服务发现
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 仅抓取带有prometheus.io/scrape注释的pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 使用自定义指标路径(如果指定)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 使用自定义端口(如果指定)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 添加命名空间标签
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# 添加pod名称标签
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# 添加服务名称标签
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
# 节点导出器用于主机指标
- job_name: "node-exporter"
static_configs:
- targets:
- "node-exporter:9100"
Alertmanager配置
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
# 模板文件用于自定义通知
templates:
- "/etc/alertmanager/templates/*.tmpl"
# 将告警路由到适当的接收器
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: "default"
routes:
# 关键告警发送到PagerDuty
- match:
severity: critical
receiver: "pagerduty"
continue: true
# 数据库告警发送到DBA团队
- match:
team: database
receiver: "dba-team"
group_by: ["alertname", "instance"]
# 开发环境告警
- match:
env: development
receiver: "slack-dev"
group_wait: 5m
repeat_interval: 4h
# 抑制规则(抑制告警)
inhibit_rules:
# 如果关键告警触发,则抑制警告告警
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
# 如果整个服务宕机,则抑制实例告警
- source_match:
alertname: "ServiceDown"
target_match_re:
alertname: ".*"
equal: ["service"]
receivers:
- name: "default"
slack_configs:
- channel: "#alerts"
title: "告警: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
- name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
description: "{{ .GroupLabels.alertname }}"
- name: "dba-team"
slack_configs:
- channel: "#database-alerts"
email_configs:
- to: "dba-team@example.com"
headers:
Subject: "数据库告警: {{ .GroupLabels.alertname }}"
- name: "slack-dev"
slack_configs:
- channel: "#dev-alerts"
send_resolved: true
最佳实践
指标命名约定
遵循这些命名模式以确保一致性:
# 格式: <命名空间>_<子系统>_<指标>_<单位>
# 计数器(始终使用_total后缀)
http_requests_total
http_request_errors_total
cache_hits_total
# 仪表
memory_usage_bytes
active_connections
queue_size
# 直方图(自动使用_bucket、_sum、_count后缀)
http_request_duration_seconds
response_size_bytes
db_query_duration_seconds
# 使用一致的基单位
- seconds用于持续时间(不是毫秒)
- bytes用于大小(不是千字节)
- ratio用于百分比(0.0-1.0,不是0-100)
标签基数管理
做
# 好:有限基数
http_requests_total{method="GET", status="200", endpoint="/api/users"}
# 好:合理的标签值数量
db_queries_total{table="users", operation="select"}
不做
# 坏:无限基数(用户ID、电子邮件地址、时间戳)
http_requests_total{user_id="12345"}
http_requests_total{email="user@example.com"}
http_requests_total{timestamp="1234567890"}
# 坏:高基数(完整URL、IP地址)
http_requests_total{url="/api/users/12345/profile"}
http_requests_total{client_ip="192.168.1.100"}
指导原则
- 每个标签的值保持小于10(理想)
- 每个指标的唯一时间序列总数应小于10,000
- 使用记录规则预先聚合高基数指标
- 避免使用无限值的标签(ID、时间戳、用户输入)
性能的记录规则
使用记录规则预先计算昂贵查询:
# rules/recording_rules.yml
groups:
- name: performance_rules
interval: 30s
rules:
# 预先计算请求率
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# 预先计算错误率
- record: job:http_request_errors:rate5m
expr: sum(rate(http_request_errors_total[5m])) by (job)
# 预先计算错误比率
- record: job:http_request_error_ratio:rate5m
expr: |
job:http_request_errors:rate5m
/
job:http_requests:rate5m
# 预先聚合延迟百分位数
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- name: aggregation_rules
interval: 1m
rules:
# 仪表板的多级聚合
- record: instance:node_cpu_utilization:ratio
expr: |
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
- record: cluster:node_cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
# 内存聚合
- record: instance:node_memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
告警设计(症状与原因)
告警基于症状(用户影响),而不是原因
# alerts/symptom_based.yml
groups:
- name: symptom_alerts
rules:
# 好:告警基于用户症状
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "检测到高错误率"
description: "错误率为 {{ $value | humanizePercentage }}(阈值: 5%)"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "{{ $labels.service }} 上高延迟"
description: "P95延迟为 {{ $value }}s(阈值: 1s)"
impact: "用户遇到页面加载缓慢"
# 好:基于SLO的告警
- alert: SLOBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * (1 - 0.999)) # 99.9% SLO的14.4倍燃烧率
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "SLO预算燃烧过快"
description: "按当前速率,月度错误预算将在 {{ $value | humanizeDuration }} 内耗尽"
原因告警(用于调试,不用于分页)
# alerts/cause_based.yml
groups:
- name: infrastructure_alerts
rules:
# 基础设施问题的低严重性告警
- alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning # 除非症状出现,否则不严重
team: infrastructure
annotations:
summary: "{{ $labels.instance }} 上高内存使用"
description: "内存使用率为 {{ $value | humanizePercentage }}"
- alert: DiskSpaceLow
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) < 0.1
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "{{ $labels.instance }} 上磁盘空间低"
description: "仅剩余 {{ $value | humanizePercentage }} 磁盘空间"
action: "清理日志或扩展磁盘"
告警最佳实践
- 持续时间: 使用
for子句避免抖动 - 有意义的注释: 包括摘要、描述、runbook URL、影响
- 适当的严重性级别: 关键(立即分页)、警告(工单)、信息(日志)
- 可操作的告警: 每个告警应需要人工操作
- 包含上下文: 添加团队所有权、服务、环境的标签
PromQL查询模式
PromQL是Prometheus的查询语言。关键概念:即时向量、范围向量、标量、字符串字面量、选择器、运算符、函数和聚合。
选择器和匹配器
# 即时向量选择器(每个时间序列的最新样本)
http_requests_total
# 按标签值过滤
http_requests_total{method="GET", status="200"}
# 正则匹配(=~)和负正则(!~)
http_requests_total{status=~"5.."} # 5xx错误
http_requests_total{endpoint!~"/admin.*"} # 排除管理员端点
# 标签存在/缺失
http_requests_total{job="api", status=""} # 空标签
http_requests_total{job="api", status!=""} # 非空标签
# 范围向量选择器(随时间样本)
http_requests_total[5m] # 最后5分钟的样本
率计算
# 请求率(每秒请求数)- 始终对计数器使用rate()
rate(http_requests_total[5m])
# 按服务求和
sum(rate(http_requests_total[5m])) by (service)
# 时间窗口内的增加(总数)- 用于显示总计的告警/仪表板
increase(http_requests_total[1h])
# irate()用于波动、快速移动的计数器(对峰值更敏感)
irate(http_requests_total[5m])
错误比率
# 错误率比率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 成功率
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
直方图查询
# P95延迟
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# 按服务的P50、P95、P99延迟
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# 平均请求持续时间
sum(rate(http_request_duration_seconds_sum[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
聚合操作
# 跨所有实例求和
sum(node_memory_MemTotal_bytes) by (cluster)
# 平均CPU使用
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# 最大值
max(http_request_duration_seconds) by (service)
# 最小值
min(node_filesystem_avail_bytes) by (instance)
# 实例数量
count(up == 1) by (job)
# 标准偏差
stddev(http_request_duration_seconds) by (service)
高级查询
# 请求率前5的服务
topk(5, sum(rate(http_requests_total[5m])) by (service))
# 可用内存最低的3个实例
bottomk(3, node_memory_MemAvailable_bytes)
# 预测磁盘满时间(线性回归)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
# 与1天前比较
http_requests_total - http_requests_total offset 1d
# 变化率(导数)
deriv(node_memory_MemAvailable_bytes[5m])
# 缺失指标检测
absent(up{job="critical-service"})
复杂聚合
# 计算Apdex分数(应用性能指数)
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5
)
/
sum(rate(http_request_duration_seconds_count[5m]))
# 多窗口多燃烧率SLO
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
> 0.001 * 14.4
)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.001 * 14.4
)
二元运算符和向量匹配
# 算术运算符(+, -, *, /, %, ^)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# 比较运算符(==, !=, >, <, >=, <=)- 过滤到匹配值
http_request_duration_seconds > 1
# 逻辑运算符(and, or, unless)
up{job="api"} and rate(http_requests_total[5m]) > 100
# 一对一匹配(默认)
method:http_requests:rate5m / method:http_requests:total
# 多对一匹配使用group_left
sum(rate(http_requests_total[5m])) by (instance, method)
/ on(instance) group_left
sum(rate(http_requests_total[5m])) by (instance)
# 一对多匹配使用group_right
sum(rate(http_requests_total[5m])) by (instance)
/ on(instance) group_right
sum(rate(http_requests_total[5m])) by (instance, method)
时间函数和偏移
# 与前一时间段比较
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
# 日同比比较
http_requests_total - http_requests_total offset 1d
# 基于时间的过滤
http_requests_total and hour() >= 9 and hour() < 17 # 工作时间
day_of_week() == 0 or day_of_week() == 6 # 周末
# 时间戳函数
time() - process_start_time_seconds # 正常运行时间(秒)
服务发现
Prometheus支持多种服务发现机制,用于目标出现和消失的动态环境。
静态配置
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets:
- "host1:9100"
- "host2:9100"
labels:
env: production
region: us-east-1
基于文件的服务发现
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
# targets/webservers.json
[
{
"targets": ["web1:8080", "web2:8080"],
"labels": {
"job": "web",
"env": "prod"
}
}
]
Kubernetes服务发现
scrape_configs:
# 基于Pod的发现
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
- staging
relabel_configs:
# 仅保留带有prometheus.io/scrape=true注释的pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 从注释提取自定义抓取路径
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 从注释提取自定义端口
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 添加标准Kubernetes标签
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
# 基于服务的发现
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 基于节点的发现(用于节点导出器)
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# 端点发现(用于服务端点)
- job_name: "kubernetes-endpoints"
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: metrics
Consul服务发现
scrape_configs:
- job_name: "consul-services"
consul_sd_configs:
- server: "consul.example.com:8500"
datacenter: "dc1"
services: ["web", "api", "cache"]
tags: ["production"]
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_tags]
target_label: tags
EC2服务发现
scrape_configs:
- job_name: "ec2-instances"
ec2_sd_configs:
- region: us-east-1
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
port: 9100
filters:
- name: tag:Environment
values: [production]
- name: instance-state-name
values: [running]
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- source_labels: [__meta_ec2_availability_zone]
target_label: availability_zone
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type
DNS服务发现
scrape_configs:
- job_name: "dns-srv-records"
dns_sd_configs:
- names:
- "_prometheus._tcp.example.com"
type: "SRV"
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_dns_name]
target_label: instance
重新标记操作参考
| 操作 | 描述 | 使用场景 |
|---|---|---|
keep |
保留正则匹配源标签的目标 | 按注释/标签过滤目标 |
drop |
丢弃正则匹配源标签的目标 | 排除特定目标 |
replace |
用源标签值替换目标标签 | 提取自定义标签/路径/端口 |
labelmap |
通过正则将源标签名称映射到目标标签 | 复制所有Kubernetes标签 |
labeldrop |
丢弃匹配正则的标签 | 移除内部元数据标签 |
labelkeep |
仅保留匹配正则的标签 | 减少基数 |
hashmod |
设置目标标签为源标签哈希模N | 分片/路由 |
高可用性和可扩展性
Prometheus高可用设置
# 部署多个相同的Prometheus实例抓取相同目标
# 使用外部标签区分实例
global:
external_labels:
replica: prometheus-1 # 更改为prometheus-2等
cluster: production
# Alertmanager将从多个Prometheus实例去重告警
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
- alertmanager-3:9093
Alertmanager集群
# alertmanager.yml - HA集群配置
global:
resolve_timeout: 5m
route:
receiver: "default"
group_by: ["alertname", "cluster"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receivers:
- name: "default"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"
channel: "#alerts"
# 启动Alertmanager集群成员
# alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094
联合用于分层监控
# 全局Prometheus从区域实例联合
scrape_configs:
- job_name: "federate"
scrape_interval: 15s
honor_labels: true
metrics_path: "/federate"
params:
"match[]":
# 仅拉取聚合指标
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # 记录规则
- "up"
static_configs:
- targets:
- "prometheus-us-east-1:9090"
- "prometheus-us-west-2:9090"
- "prometheus-eu-west-1:9090"
labels:
region: "us-east-1"
远程存储用于长期保留
# Prometheus远程写入到Thanos/Cortex/Mimir
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 50
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 100ms
write_relabel_configs:
# 远程写入前丢弃高基数指标
- source_labels: [__name__]
regex: "go_.*"
action: drop
# Prometheus远程读取长期存储
remote_read:
- url: "http://thanos-query:9090/api/v1/read"
read_recent: true
Thanos架构用于全局视图
# Thanos Sidecar - 与Prometheus一起运行
thanos sidecar \
--prometheus.url=http://localhost:9090 \
--tsdb.path=/prometheus \
--objstore.config-file=/etc/thanos/bucket.yml \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902
# Thanos Store - 查询对象存储
thanos store \
--data-dir=/var/thanos/store \
--objstore.config-file=/etc/thanos/bucket.yml \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902
# Thanos Query - 全局查询接口
thanos query \
--http-address=0.0.0.0:9090 \
--grpc-address=0.0.0.0:10901 \
--store=prometheus-1-sidecar:10901 \
--store=prometheus-2-sidecar:10901 \
--store=thanos-store:10901
# Thanos Compactor - 下采样和压缩块
thanos compact \
--data-dir=/var/thanos/compact \
--objstore.config-file=/etc/thanos/bucket.yml \
--retention.resolution-raw=30d \
--retention.resolution-5m=90d \
--retention.resolution-1h=365d
使用Hashmod的水平分片
# 使用hashmod跨多个Prometheus实例分割抓取目标
scrape_configs:
- job_name: "kubernetes-pods-shard-0"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 哈希pod名称并仅保留分片0(模3)
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "0"
action: keep
- job_name: "kubernetes-pods-shard-1"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "1"
action: keep
# shard-2类似模式...
Kubernetes集成
ServiceMonitor用于Prometheus操作符
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
labels:
app: myapp
release: prometheus
spec:
# 选择要监控的服务
selector:
matchLabels:
app: myapp
# 定义要搜索的命名空间
namespaceSelector:
matchNames:
- production
- staging
# 端点配置
endpoints:
- port: metrics # 服务端口名称
path: /metrics
interval: 30s
scrapeTimeout: 10s
# 重新标记
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
# 指标重新标记(过滤/修改指标)
metricRelabelings:
- sourceLabels: [__name__]
regex: "go_.*"
action: drop # 丢弃Go运行时指标
- sourceLabels: [status]
regex: "[45].."
targetLabel: error
replacement: "true"
# 可选:TLS配置
# tlsConfig:
# insecureSkipVerify: true
# ca:
# secret:
# name: prometheus-tls
# key: ca.crt
PodMonitor用于直接Pod抓取
# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: app-pods
namespace: monitoring
labels:
release: prometheus
spec:
# 选择要监控的pod
selector:
matchLabels:
app: myapp
# 命名空间选择
namespaceSelector:
matchNames:
- production
# Pod指标端点
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 15s
# 重新标记
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_version]
targetLabel: version
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
PrometheusRule用于告警和记录规则
# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-rules
namespace: monitoring
labels:
release: prometheus
role: alert-rules
spec:
groups:
- name: app_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
/
sum(rate(http_requests_total{app="myapp"}[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "{{ $labels.namespace }}/{{ $labels.pod }} 上高错误率"
description: "错误率为 {{ $value | humanizePercentage }}"
dashboard: "https://grafana.example.com/d/app-overview"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 正在崩溃循环"
description: "容器 {{ $labels.container }} 在15m内重启了 {{ $value }} 次"
- name: app_recording_rules
interval: 30s
rules:
- record: app:http_requests:rate5m
expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)
- record: app:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
)
Prometheus自定义资源
# prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
version: v2.45.0
# 用于Kubernetes API访问的服务账户
serviceAccountName: prometheus
# 选择ServiceMonitors
serviceMonitorSelector:
matchLabels:
release: prometheus
# 选择PodMonitors
podMonitorSelector:
matchLabels:
release: prometheus
# 选择PrometheusRules
ruleSelector:
matchLabels:
release: prometheus
role: alert-rules
# 资源限制
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
# 存储
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
# 保留期
retention: 30d
retentionSize: 45GB
# Alertmanager配置
alerting:
alertmanagers:
- namespace: monitoring
name: alertmanager
port: web
# 外部标签
externalLabels:
cluster: production
region: us-east-1
# 安全上下文
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
# 启用管理API用于管理操作
enableAdminAPI: false
# 额外抓取配置(从Secret)
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
应用检测示例
Go应用
// main.go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// 总请求计数器
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "HTTP请求总数",
},
[]string{"method", "endpoint", "status"},
)
// 请求持续时间直方图
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP请求持续时间(秒)",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
// 活动连接仪表
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "活动连接数",
},
)
// 响应大小摘要
responseSizeBytes = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_response_size_bytes",
Help: "HTTP响应大小(字节)",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"endpoint"},
)
)
// 中间件用于检测HTTP处理程序
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
// 包装响应写入器以捕获状态码
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
handler(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, endpoint,
http.StatusText(wrapped.statusCode)).Inc()
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func handleUsers(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
func main() {
// 注册处理程序
http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
http.Handle("/metrics", promhttp.Handler())
// 启动服务器
http.ListenAndServe(":8080", nil)
}
Python应用(Flask)
# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
app = Flask(__name__)
# 定义指标
request_count = Counter(
'http_requests_total',
'总HTTP请求',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP请求持续时间(秒)',
['method', 'endpoint'],
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
active_requests = Gauge(
'active_requests',
'活动请求数'
)
# 检测中间件
@app.before_request
def before_request():
active_requests.inc()
request.start_time = time.time()
@app.after_request
def after_request(response):
active_requests.dec()
duration = time.time() - request.start_time
request_duration.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(duration)
request_count.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
return response
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/api/users')
def users():
return {'users': []}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
生产部署检查清单
- [ ] 设置适当的保留期(平衡存储与历史需求)
- [ ] 配置具有足够大小的持久存储
- [ ] 启用高可用性(多个Prometheus副本或联合)
- [ ] 设置长期保留的远程存储(Thanos、Cortex、Mimir)
- [ ] 配置动态环境的服务发现
- [ ] 实施频繁使用查询的记录规则
- [ ] 创建具有适当注释的症状告警
- [ ] 设置具有适当路由和接收器的Alertmanager
- [ ] 配置抑制规则以减少告警噪音
- [ ] 添加所有关键告警的runbook URL
- [ ] 实施适当的标签卫生(避免高基数)
- [ ] 监控Prometheus自身(元监控)
- [ ] 设置身份验证和授权
- [ ] 为抓取目标和远程存储启用TLS
- [ ] 配置查询的速率限制
- [ ] 测试告警和记录规则的有效性(
promtool check rules) - [ ] 实施备份和灾难恢复程序
- [ ] 为团队记录指标命名约定
- [ ] 在Grafana中创建常见查询的仪表板
- [ ] 设置与指标并行的日志聚合(Loki)
故障排除命令
# 检查Prometheus配置语法
promtool check config prometheus.yml
# 检查规则文件语法
promtool check rules alerts/*.yml
# 测试PromQL查询
promtool query instant http://localhost:9090 'up'
# 检查哪些目标已启动
curl http://localhost:9090/api/v1/targets
# 查询当前指标值
curl 'http://localhost:9090/api/v1/query?query=up'
# 检查服务发现
curl http://localhost:9090/api/v1/targets/metadata
# 查看TSDB统计
curl http://localhost:9090/api/v1/status/tsdb
# 检查运行时信息
curl http://localhost:9090/api/v1/status/runtimeinfo
快速参考
常见PromQL模式
# 每秒请求率
rate(http_requests_total[5m])
# 错误比率百分比
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# 从直方图的P95延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 从直方图的平均延迟
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
# 内存使用率百分比
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU使用率(非空闲)
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
# 剩余磁盘空间百分比
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes
# 请求率前5的端点
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
# 服务正常运行时间(天)
(time() - process_start_time_seconds) / 86400
# 请求率与1小时前相比增长
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
告警规则模式
# 高错误率(症状)
alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "错误率为 {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/high-error-rate"
# 高延迟P95
alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
# 服务宕机
alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels:
severity: critical
# 磁盘空间低(原因,仅警告)
alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
severity: warning
# Pod崩溃循环
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
记录规则命名约定
# 格式: level:metric:operations
# level = 聚合级别(job, instance, cluster)
# metric = 基础指标名称
# operations = 应用的转换(rate5m, sum, ratio)
groups:
- name: aggregation_rules
rules:
# 实例级聚合
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# 任务级聚合
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# 任务级错误比率
- record: job:http_request_errors:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
# 集群级聚合
- record: cluster:cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
指标命名最佳实践
| 模式 | 好示例 | 坏示例 |
|---|---|---|
| 计数器后缀 | http_requests_total |
http_requests |
| 基单位 | http_request_duration_seconds |
http_request_duration_ms |
| 比率范围 | cache_hit_ratio (0.0-1.0) |
cache_hit_percentage (0-100) |
| 字节单位 | response_size_bytes |
response_size_kb |
| 命名空间前缀 | myapp_http_requests_total |
http_requests_total |
| 标签命名 | {method="GET", status="200"} |
{httpMethod="GET", statusCode="200"} |
标签基数指南
| 基数 | 示例 | 推荐 |
|---|---|---|
| 低 (<10) | HTTP方法、状态码、环境 | 所有标签安全 |
| 中 (10-100) | API端点、服务名称、pod名称 | 聚合时安全 |
| 高 (100-1000) | 容器ID、主机名 | 仅必要时使用 |
| 无限 | 用户ID、IP地址、时间戳、URL路径 | 永不作为标签使用 |
Kubernetes基于注释的抓取
# Pod注释用于自动Prometheus抓取
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
prometheus.io/scheme: "http"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
name: metrics
Alertmanager路由模式
route:
receiver: default
group_by: ["alertname", "cluster"]
routes:
# 关键告警到PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true # 也发送到默认
# 基于团队的路由
- match:
team: database
receiver: dba-team
group_by: ["alertname", "instance"]
# 基于环境的路由
- match:
env: development
receiver: slack-dev
repeat_interval: 4h
# 基于时间的路由(仅工作时间)
- match:
severity: warning
receiver: email
active_time_intervals:
- business-hours
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: "09:00"
end_time: "17:00"
weekdays: ["monday:friday"]