名称: prometheus 描述: | Prometheus监控和告警用于云原生可观测性。

使用场景: 编写PromQL查询、配置Prometheus抓取目标、创建告警规则、设置记录规则、使用Prometheus指标检测应用、配置服务发现。不要使用: 用于构建仪表板（使用/grafana）、日志分析（使用/logging-observability）、一般可观测性架构（使用具有基础设施重点的高级软件工程师）。

触发词: 指标、prometheus、promql、计数器、仪表、直方图、摘要、告警、alertmanager、告警规则、记录规则、抓取、目标、标签、服务发现、重新标记、导出器、检测、slo、错误预算。触发词:

指标
prometheus
promql
计数器
仪表
直方图
摘要
告警
alertmanager
告警规则
记录规则
抓取
目标
标签
服务发现
重新标记
导出器
检测
slo
错误预算允许工具: Read, Grep, Glob, Edit, Write, Bash

Prometheus监控与告警

概述

Prometheus是一个强大的开源监控和告警系统，设计用于云原生环境中的可靠性和可扩展性。构建用于多维时间序列数据，通过PromQL进行灵活查询。

架构组件

Prometheus服务器: 核心组件，抓取和存储时间序列数据，使用本地TSDB
Alertmanager: 处理告警、去重、分组、路由和通知接收器
Pushgateway: 允许临时作业推送指标（谨慎使用 - 优先拉取模型）
导出器: 将第三方系统的指标转换为Prometheus格式（节点、黑盒等）
客户端库: 检测应用代码（Go、Java、Python、Rust等）
Prometheus操作符: 通过CRD进行Kubernetes原生部署和管理
远程存储: 通过Thanos、Cortex、Mimir实现多集群联合的长期存储

数据模型

指标: 时间序列数据，由指标名称和键值标签标识
格式: metric_name{label1="value1", label2="value2"} sample_value timestamp
指标类型:
- 计数器: 单调递增的值（请求、错误） - 使用rate()或increase()进行查询
- 仪表: 可以上升或下降的值（温度、内存使用、队列长度）
- 直方图: 在可配置桶中的观察（延迟、请求大小） - 暴露_bucket、_sum、_count
- 摘要: 类似于直方图，但客户端计算分位数 - 使用直方图进行聚合

设置和配置

基本Prometheus服务器配置

# prometheus.yml
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-east-1"

# Alertmanager配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# 加载规则文件
rule_files:
  - "alerts/*.yml"
  - "rules/*.yml"

# 抓取配置
scrape_configs:
  # Prometheus自身
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # 应用服务
  - job_name: "application"
    metrics_path: "/metrics"
    static_configs:
      - targets:
          - "app-1:8080"
          - "app-2:8080"
        labels:
          env: "production"
          team: "backend"

  # Kubernetes服务发现
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 仅抓取带有prometheus.io/scrape注释的pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # 使用自定义指标路径（如果指定）
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # 使用自定义端口（如果指定）
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # 添加命名空间标签
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # 添加pod名称标签
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
      # 添加服务名称标签
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

  # 节点导出器用于主机指标
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node-exporter:9100"

Alertmanager配置

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# 模板文件用于自定义通知
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# 将告警路由到适当的接收器
route:
  group_by: ["alertname", "cluster", "service"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: "default"

  routes:
    # 关键告警发送到PagerDuty
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true

    # 数据库告警发送到DBA团队
    - match:
        team: database
      receiver: "dba-team"
      group_by: ["alertname", "instance"]

    # 开发环境告警
    - match:
        env: development
      receiver: "slack-dev"
      group_wait: 5m
      repeat_interval: 4h

# 抑制规则（抑制告警）
inhibit_rules:
  # 如果关键告警触发，则抑制警告告警
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

  # 如果整个服务宕机，则抑制实例告警
  - source_match:
      alertname: "ServiceDown"
    target_match_re:
      alertname: ".*"
    equal: ["service"]

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts"
        title: "告警: {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
        description: "{{ .GroupLabels.alertname }}"

  - name: "dba-team"
    slack_configs:
      - channel: "#database-alerts"
    email_configs:
      - to: "dba-team@example.com"
        headers:
          Subject: "数据库告警: {{ .GroupLabels.alertname }}"

  - name: "slack-dev"
    slack_configs:
      - channel: "#dev-alerts"
        send_resolved: true

最佳实践

指标命名约定

遵循这些命名模式以确保一致性：

# 格式: <命名空间>_<子系统>_<指标>_<单位>

# 计数器（始终使用_total后缀）
http_requests_total
http_request_errors_total
cache_hits_total

# 仪表
memory_usage_bytes
active_connections
queue_size

# 直方图（自动使用_bucket、_sum、_count后缀）
http_request_duration_seconds
response_size_bytes
db_query_duration_seconds

# 使用一致的基单位
- seconds用于持续时间（不是毫秒）
- bytes用于大小（不是千字节）
- ratio用于百分比（0.0-1.0，不是0-100）

标签基数管理

做

# 好：有限基数
http_requests_total{method="GET", status="200", endpoint="/api/users"}

# 好：合理的标签值数量
db_queries_total{table="users", operation="select"}

不做

# 坏：无限基数（用户ID、电子邮件地址、时间戳）
http_requests_total{user_id="12345"}
http_requests_total{email="user@example.com"}
http_requests_total{timestamp="1234567890"}

# 坏：高基数（完整URL、IP地址）
http_requests_total{url="/api/users/12345/profile"}
http_requests_total{client_ip="192.168.1.100"}

指导原则

每个标签的值保持小于10（理想）
每个指标的唯一时间序列总数应小于10,000
使用记录规则预先聚合高基数指标
避免使用无限值的标签（ID、时间戳、用户输入）

性能的记录规则

使用记录规则预先计算昂贵查询：

# rules/recording_rules.yml
groups:
  - name: performance_rules
    interval: 30s
    rules:
      # 预先计算请求率
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # 预先计算错误率
      - record: job:http_request_errors:rate5m
        expr: sum(rate(http_request_errors_total[5m])) by (job)

      # 预先计算错误比率
      - record: job:http_request_error_ratio:rate5m
        expr: |
          job:http_request_errors:rate5m
          /
          job:http_requests:rate5m

      # 预先聚合延迟百分位数
      - record: job:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

      - record: job:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

  - name: aggregation_rules
    interval: 1m
    rules:
      # 仪表板的多级聚合
      - record: instance:node_cpu_utilization:ratio
        expr: |
          1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

      - record: cluster:node_cpu_utilization:ratio
        expr: avg(instance:node_cpu_utilization:ratio)

      # 内存聚合
      - record: instance:node_memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          )

告警设计（症状与原因）

告警基于症状（用户影响），而不是原因

# alerts/symptom_based.yml
groups:
  - name: symptom_alerts
    rules:
      # 好：告警基于用户症状
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "检测到高错误率"
          description: "错误率为 {{ $value | humanizePercentage }}（阈值: 5%）"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "{{ $labels.service }} 上高延迟"
          description: "P95延迟为 {{ $value }}s（阈值: 1s）"
          impact: "用户遇到页面加载缓慢"

      # 好：基于SLO的告警
      - alert: SLOBudgetBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) > (14.4 * (1 - 0.999))  # 99.9% SLO的14.4倍燃烧率
        for: 5m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "SLO预算燃烧过快"
          description: "按当前速率，月度错误预算将在 {{ $value | humanizeDuration }} 内耗尽"

原因告警（用于调试，不用于分页）

# alerts/cause_based.yml
groups:
  - name: infrastructure_alerts
    rules:
      # 基础设施问题的低严重性告警
      - alert: HighMemoryUsage
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes > 0.9
        for: 10m
        labels:
          severity: warning # 除非症状出现，否则不严重
          team: infrastructure
        annotations:
          summary: "{{ $labels.instance }} 上高内存使用"
          description: "内存使用率为 {{ $value | humanizePercentage }}"

      - alert: DiskSpaceLow
        expr: |
          (
            node_filesystem_avail_bytes{mountpoint="/"}
            /
            node_filesystem_size_bytes{mountpoint="/"}
          ) < 0.1
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "{{ $labels.instance }} 上磁盘空间低"
          description: "仅剩余 {{ $value | humanizePercentage }} 磁盘空间"
          action: "清理日志或扩展磁盘"

告警最佳实践

持续时间: 使用for子句避免抖动
有意义的注释: 包括摘要、描述、runbook URL、影响
适当的严重性级别: 关键（立即分页）、警告（工单）、信息（日志）
可操作的告警: 每个告警应需要人工操作
包含上下文: 添加团队所有权、服务、环境的标签

PromQL查询模式

PromQL是Prometheus的查询语言。关键概念：即时向量、范围向量、标量、字符串字面量、选择器、运算符、函数和聚合。

选择器和匹配器

# 即时向量选择器（每个时间序列的最新样本）
http_requests_total

# 按标签值过滤
http_requests_total{method="GET", status="200"}

# 正则匹配（=~）和负正则（!~）
http_requests_total{status=~"5.."}  # 5xx错误
http_requests_total{endpoint!~"/admin.*"}  # 排除管理员端点

# 标签存在/缺失
http_requests_total{job="api", status=""}  # 空标签
http_requests_total{job="api", status!=""}  # 非空标签

# 范围向量选择器（随时间样本）
http_requests_total[5m]  # 最后5分钟的样本

率计算

# 请求率（每秒请求数）- 始终对计数器使用rate()
rate(http_requests_total[5m])

# 按服务求和
sum(rate(http_requests_total[5m])) by (service)

# 时间窗口内的增加（总数）- 用于显示总计的告警/仪表板
increase(http_requests_total[1h])

# irate()用于波动、快速移动的计数器（对峰值更敏感）
irate(http_requests_total[5m])

错误比率

# 错误率比率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# 成功率
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

直方图查询

# P95延迟
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# 按服务的P50、P95、P99延迟
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# 平均请求持续时间
sum(rate(http_request_duration_seconds_sum[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)

聚合操作

# 跨所有实例求和
sum(node_memory_MemTotal_bytes) by (cluster)

# 平均CPU使用
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# 最大值
max(http_request_duration_seconds) by (service)

# 最小值
min(node_filesystem_avail_bytes) by (instance)

# 实例数量
count(up == 1) by (job)

# 标准偏差
stddev(http_request_duration_seconds) by (service)

高级查询

# 请求率前5的服务
topk(5, sum(rate(http_requests_total[5m])) by (service))

# 可用内存最低的3个实例
bottomk(3, node_memory_MemAvailable_bytes)

# 预测磁盘满时间（线性回归）
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

# 与1天前比较
http_requests_total - http_requests_total offset 1d

# 变化率（导数）
deriv(node_memory_MemAvailable_bytes[5m])

# 缺失指标检测
absent(up{job="critical-service"})

复杂聚合

# 计算Apdex分数（应用性能指数）
(
  sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5
)
/
sum(rate(http_request_duration_seconds_count[5m]))

# 多窗口多燃烧率SLO
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
  > 0.001 * 14.4
)
and
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
  > 0.001 * 14.4
)

二元运算符和向量匹配

# 算术运算符（+, -, *, /, %, ^）
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# 比较运算符（==, !=, >, <, >=, <=）- 过滤到匹配值
http_request_duration_seconds > 1

# 逻辑运算符（and, or, unless）
up{job="api"} and rate(http_requests_total[5m]) > 100

# 一对一匹配（默认）
method:http_requests:rate5m / method:http_requests:total

# 多对一匹配使用group_left
sum(rate(http_requests_total[5m])) by (instance, method)
  / on(instance) group_left
sum(rate(http_requests_total[5m])) by (instance)

# 一对多匹配使用group_right
sum(rate(http_requests_total[5m])) by (instance)
  / on(instance) group_right
sum(rate(http_requests_total[5m])) by (instance, method)

时间函数和偏移

# 与前一时间段比较
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

# 日同比比较
http_requests_total - http_requests_total offset 1d

# 基于时间的过滤
http_requests_total and hour() >= 9 and hour() < 17  # 工作时间
day_of_week() == 0 or day_of_week() == 6  # 周末

# 时间戳函数
time() - process_start_time_seconds  # 正常运行时间（秒）

服务发现

Prometheus支持多种服务发现机制，用于目标出现和消失的动态环境。

静态配置

scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets:
          - "host1:9100"
          - "host2:9100"
        labels:
          env: production
          region: us-east-1

基于文件的服务发现

scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s

# targets/webservers.json
[
  {
    "targets": ["web1:8080", "web2:8080"],
    "labels": {
      "job": "web",
      "env": "prod"
    }
  }
]

Kubernetes服务发现

scrape_configs:
  # 基于Pod的发现
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
            - staging
    relabel_configs:
      # 仅保留带有prometheus.io/scrape=true注释的pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # 从注释提取自定义抓取路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # 从注释提取自定义端口
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # 添加标准Kubernetes标签
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

  # 基于服务的发现
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # 基于节点的发现（用于节点导出器）
  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # 端点发现（用于服务端点）
  - job_name: "kubernetes-endpoints"
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics

Consul服务发现

scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        datacenter: "dc1"
        services: ["web", "api", "cache"]
        tags: ["production"]
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_tags]
        target_label: tags

EC2服务发现

scrape_configs:
  - job_name: "ec2-instances"
    ec2_sd_configs:
      - region: us-east-1
        access_key: YOUR_ACCESS_KEY
        secret_key: YOUR_SECRET_KEY
        port: 9100
        filters:
          - name: tag:Environment
            values: [production]
          - name: instance-state-name
            values: [running]
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name
      - source_labels: [__meta_ec2_availability_zone]
        target_label: availability_zone
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type

DNS服务发现

scrape_configs:
  - job_name: "dns-srv-records"
    dns_sd_configs:
      - names:
          - "_prometheus._tcp.example.com"
        type: "SRV"
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance

重新标记操作参考

操作	描述	使用场景
`keep`	保留正则匹配源标签的目标	按注释/标签过滤目标
`drop`	丢弃正则匹配源标签的目标	排除特定目标
`replace`	用源标签值替换目标标签	提取自定义标签/路径/端口
`labelmap`	通过正则将源标签名称映射到目标标签	复制所有Kubernetes标签
`labeldrop`	丢弃匹配正则的标签	移除内部元数据标签
`labelkeep`	仅保留匹配正则的标签	减少基数
`hashmod`	设置目标标签为源标签哈希模N	分片/路由

高可用性和可扩展性

Prometheus高可用设置

# 部署多个相同的Prometheus实例抓取相同目标
# 使用外部标签区分实例
global:
  external_labels:
    replica: prometheus-1 # 更改为prometheus-2等
    cluster: production

# Alertmanager将从多个Prometheus实例去重告警
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-1:9093
            - alertmanager-2:9093
            - alertmanager-3:9093

Alertmanager集群

# alertmanager.yml - HA集群配置
global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "cluster"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
  - name: "default"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"
        channel: "#alerts"

# 启动Alertmanager集群成员
# alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094

联合用于分层监控

# 全局Prometheus从区域实例联合
scrape_configs:
  - job_name: "federate"
    scrape_interval: 15s
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        # 仅拉取聚合指标
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}' # 记录规则
        - "up"
    static_configs:
      - targets:
          - "prometheus-us-east-1:9090"
          - "prometheus-us-west-2:9090"
          - "prometheus-eu-west-1:9090"
        labels:
          region: "us-east-1"

远程存储用于长期保留

# Prometheus远程写入到Thanos/Cortex/Mimir
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      capacity: 10000
      max_shards: 50
      min_shards: 1
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 100ms
    write_relabel_configs:
      # 远程写入前丢弃高基数指标
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop

# Prometheus远程读取长期存储
remote_read:
  - url: "http://thanos-query:9090/api/v1/read"
    read_recent: true

Thanos架构用于全局视图

# Thanos Sidecar - 与Prometheus一起运行
thanos sidecar \
  --prometheus.url=http://localhost:9090 \
  --tsdb.path=/prometheus \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

# Thanos Store - 查询对象存储
thanos store \
  --data-dir=/var/thanos/store \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

# Thanos Query - 全局查询接口
thanos query \
  --http-address=0.0.0.0:9090 \
  --grpc-address=0.0.0.0:10901 \
  --store=prometheus-1-sidecar:10901 \
  --store=prometheus-2-sidecar:10901 \
  --store=thanos-store:10901

# Thanos Compactor - 下采样和压缩块
thanos compact \
  --data-dir=/var/thanos/compact \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=365d

使用Hashmod的水平分片

# 使用hashmod跨多个Prometheus实例分割抓取目标
scrape_configs:
  - job_name: "kubernetes-pods-shard-0"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 哈希pod名称并仅保留分片0（模3）
      - source_labels: [__meta_kubernetes_pod_name]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: "0"
        action: keep

  - job_name: "kubernetes-pods-shard-1"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_name]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: "1"
        action: keep

  # shard-2类似模式...

Kubernetes集成

ServiceMonitor用于Prometheus操作符

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
  labels:
    app: myapp
    release: prometheus
spec:
  # 选择要监控的服务
  selector:
    matchLabels:
      app: myapp

  # 定义要搜索的命名空间
  namespaceSelector:
    matchNames:
      - production
      - staging

  # 端点配置
  endpoints:
    - port: metrics # 服务端口名称
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

      # 重新标记
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_name]
          targetLabel: pod
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace

      # 指标重新标记（过滤/修改指标）
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: "go_.*"
          action: drop # 丢弃Go运行时指标
        - sourceLabels: [status]
          regex: "[45].."
          targetLabel: error
          replacement: "true"

  # 可选：TLS配置
  # tlsConfig:
  #   insecureSkipVerify: true
  #   ca:
  #     secret:
  #       name: prometheus-tls
  #       key: ca.crt

PodMonitor用于直接Pod抓取

# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: app-pods
  namespace: monitoring
  labels:
    release: prometheus
spec:
  # 选择要监控的pod
  selector:
    matchLabels:
      app: myapp

  # 命名空间选择
  namespaceSelector:
    matchNames:
      - production

  # Pod指标端点
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 15s

      # 重新标记
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_version]
          targetLabel: version
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node

PrometheusRule用于告警和记录规则

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-rules
  namespace: monitoring
  labels:
    release: prometheus
    role: alert-rules
spec:
  groups:
    - name: app_alerts
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            (
              sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
              /
              sum(rate(http_requests_total{app="myapp"}[5m]))
            ) > 0.05
          for: 5m
          labels:
            severity: critical
            team: backend
          annotations:
            summary: "{{ $labels.namespace }}/{{ $labels.pod }} 上高错误率"
            description: "错误率为 {{ $value | humanizePercentage }}"
            dashboard: "https://grafana.example.com/d/app-overview"
            runbook: "https://wiki.example.com/runbooks/high-error-rate"

        - alert: PodCrashLooping
          expr: |
            rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 正在崩溃循环"
            description: "容器 {{ $labels.container }} 在15m内重启了 {{ $value }} 次"

    - name: app_recording_rules
      interval: 30s
      rules:
        - record: app:http_requests:rate5m
          expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)

        - record: app:http_request_duration_seconds:p95
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
            )

Prometheus自定义资源

# prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 2
  version: v2.45.0

  # 用于Kubernetes API访问的服务账户
  serviceAccountName: prometheus

  # 选择ServiceMonitors
  serviceMonitorSelector:
    matchLabels:
      release: prometheus

  # 选择PodMonitors
  podMonitorSelector:
    matchLabels:
      release: prometheus

  # 选择PrometheusRules
  ruleSelector:
    matchLabels:
      release: prometheus
      role: alert-rules

  # 资源限制
  resources:
    requests:
      memory: 2Gi
      cpu: 1000m
    limits:
      memory: 4Gi
      cpu: 2000m

  # 存储
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
        storageClassName: fast-ssd

  # 保留期
  retention: 30d
  retentionSize: 45GB

  # Alertmanager配置
  alerting:
    alertmanagers:
      - namespace: monitoring
        name: alertmanager
        port: web

  # 外部标签
  externalLabels:
    cluster: production
    region: us-east-1

  # 安全上下文
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000

  # 启用管理API用于管理操作
  enableAdminAPI: false

  # 额外抓取配置（从Secret）
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml

应用检测示例

Go应用

// main.go
package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // 总请求计数器
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "HTTP请求总数",
        },
        []string{"method", "endpoint", "status"},
    )

    // 请求持续时间直方图
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP请求持续时间（秒）",
            Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    // 活动连接仪表
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "活动连接数",
        },
    )

    // 响应大小摘要
    responseSizeBytes = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_response_size_bytes",
            Help:       "HTTP响应大小（字节）",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"endpoint"},
    )
)

// 中间件用于检测HTTP处理程序
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeConnections.Inc()
        defer activeConnections.Dec()

        // 包装响应写入器以捕获状态码
        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(wrapped, r)

        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, endpoint,
            http.StatusText(wrapped.statusCode)).Inc()
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func handleUsers(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    // 注册处理程序
    http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
    http.Handle("/metrics", promhttp.Handler())

    // 启动服务器
    http.ListenAndServe(":8080", nil)
}

Python应用（Flask）

# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time

app = Flask(__name__)

# 定义指标
request_count = Counter(
    'http_requests_total',
    '总HTTP请求',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP请求持续时间（秒）',
    ['method', 'endpoint'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

active_requests = Gauge(
    'active_requests',
    '活动请求数'
)

# 检测中间件
@app.before_request
def before_request():
    active_requests.inc()
    request.start_time = time.time()

@app.after_request
def after_request(response):
    active_requests.dec()

    duration = time.time() - request.start_time
    request_duration.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(duration)

    request_count.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()

    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

@app.route('/api/users')
def users():
    return {'users': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

生产部署检查清单

[ ] 设置适当的保留期（平衡存储与历史需求）
[ ] 配置具有足够大小的持久存储
[ ] 启用高可用性（多个Prometheus副本或联合）
[ ] 设置长期保留的远程存储（Thanos、Cortex、Mimir）
[ ] 配置动态环境的服务发现
[ ] 实施频繁使用查询的记录规则
[ ] 创建具有适当注释的症状告警
[ ] 设置具有适当路由和接收器的Alertmanager
[ ] 配置抑制规则以减少告警噪音
[ ] 添加所有关键告警的runbook URL
[ ] 实施适当的标签卫生（避免高基数）
[ ] 监控Prometheus自身（元监控）
[ ] 设置身份验证和授权
[ ] 为抓取目标和远程存储启用TLS
[ ] 配置查询的速率限制
[ ] 测试告警和记录规则的有效性（promtool check rules）
[ ] 实施备份和灾难恢复程序
[ ] 为团队记录指标命名约定
[ ] 在Grafana中创建常见查询的仪表板
[ ] 设置与指标并行的日志聚合（Loki）

故障排除命令

# 检查Prometheus配置语法
promtool check config prometheus.yml

# 检查规则文件语法
promtool check rules alerts/*.yml

# 测试PromQL查询
promtool query instant http://localhost:9090 'up'

# 检查哪些目标已启动
curl http://localhost:9090/api/v1/targets

# 查询当前指标值
curl 'http://localhost:9090/api/v1/query?query=up'

# 检查服务发现
curl http://localhost:9090/api/v1/targets/metadata

# 查看TSDB统计
curl http://localhost:9090/api/v1/status/tsdb

# 检查运行时信息
curl http://localhost:9090/api/v1/status/runtimeinfo

快速参考

常见PromQL模式

# 每秒请求率
rate(http_requests_total[5m])

# 错误比率百分比
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# 从直方图的P95延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 从直方图的平均延迟
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

# 内存使用率百分比
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU使用率（非空闲）
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))

# 剩余磁盘空间百分比
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes

# 请求率前5的端点
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))

# 服务正常运行时间（天）
(time() - process_start_time_seconds) / 86400

# 请求率与1小时前相比增长
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

告警规则模式

# 高错误率（症状）
alert: HighErrorRate
expr: |
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
  severity: critical
annotations:
  summary: "错误率为 {{ $value | humanizePercentage }}"
  runbook: "https://runbooks.example.com/high-error-rate"

# 高延迟P95
alert: HighLatency
expr: |
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
  ) > 1
for: 5m
labels:
  severity: warning

# 服务宕机
alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels:
  severity: critical

# 磁盘空间低（原因，仅警告）
alert: DiskSpaceLow
expr: |
  node_filesystem_avail_bytes{mountpoint="/"}
  / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
  severity: warning

# Pod崩溃循环
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
  severity: warning

记录规则命名约定

# 格式: level:metric:operations
# level = 聚合级别（job, instance, cluster）
# metric = 基础指标名称
# operations = 应用的转换（rate5m, sum, ratio）

groups:
  - name: aggregation_rules
    rules:
      # 实例级聚合
      - record: instance:node_cpu_utilization:ratio
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

      # 任务级聚合
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # 任务级错误比率
      - record: job:http_request_errors:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job)

      # 集群级聚合
      - record: cluster:cpu_utilization:ratio
        expr: avg(instance:node_cpu_utilization:ratio)

指标命名最佳实践

模式	好示例	坏示例
计数器后缀	`http_requests_total`	`http_requests`
基单位	`http_request_duration_seconds`	`http_request_duration_ms`
比率范围	`cache_hit_ratio` (0.0-1.0)	`cache_hit_percentage` (0-100)
字节单位	`response_size_bytes`	`response_size_kb`
命名空间前缀	`myapp_http_requests_total`	`http_requests_total`
标签命名	`{method="GET", status="200"}`	`{httpMethod="GET", statusCode="200"}`

标签基数指南

基数	示例	推荐
低 (<10)	HTTP方法、状态码、环境	所有标签安全
中 (10-100)	API端点、服务名称、pod名称	聚合时安全
高 (100-1000)	容器ID、主机名	仅必要时使用
无限	用户ID、IP地址、时间戳、URL路径	永不作为标签使用

Kubernetes基于注释的抓取

# Pod注释用于自动Prometheus抓取
apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
    prometheus.io/scheme: "http"
spec:
  containers:
    - name: app
      image: myapp:latest
      ports:
        - containerPort: 8080
          name: metrics

Alertmanager路由模式

route:
  receiver: default
  group_by: ["alertname", "cluster"]
  routes:
    # 关键告警到PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true # 也发送到默认

    # 基于团队的路由
    - match:
        team: database
      receiver: dba-team
      group_by: ["alertname", "instance"]

    # 基于环境的路由
    - match:
        env: development
      receiver: slack-dev
      repeat_interval: 4h

    # 基于时间的路由（仅工作时间）
    - match:
        severity: warning
      receiver: email
      active_time_intervals:
        - business-hours

time_intervals:
  - name: business-hours
    time_intervals:
      - times:
          - start_time: "09:00"
            end_time: "17:00"
        weekdays: ["monday:friday"]