名称: prometheus-configuration 描述: 设置Prometheus以进行基础设施和应用程序的全面指标收集、存储和监控。在实施指标收集、设置监控基础设施或配置警报系统时使用。

Prometheus 配置

Prometheus 设置、指标收集、抓取配置和记录规则的完整指南。

目的

配置Prometheus以进行基础设施和应用程序的全面指标收集、警报和监控。

使用时机

设置Prometheus监控
配置指标抓取
创建记录规则
设计警报规则
实现服务发现

Prometheus 架构

┌──────────────┐
│ 应用程序     │ ← 使用客户端库进行仪表化
└──────┬───────┘
       │ /metrics 端点
       ↓
┌──────────────┐
│  Prometheus  │ ← 定期抓取指标
│    服务器    │
└──────┬───────┘
       │
       ├─→ AlertManager (警报)
       ├─→ Grafana (可视化)
       └─→ 长期存储 (Thanos/Cortex)

安装

使用Helm在Kubernetes中安装

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

使用Docker Compose安装

version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"

volumes:
  prometheus-data:

配置文件

prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-west-2"

# Alertmanager配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# 加载规则文件
rule_files:
  - /etc/prometheus/rules/*.yml

# 抓取配置
scrape_configs:
  # Prometheus自身
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node exporters
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node1:9100"
          - "node2:9100"
          - "node3:9100"
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "([^:]+)(:[0-9]+)?"
        replacement: "${1}"

  # 带有注解的Kubernetes pods
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

  # 应用程序指标
  - job_name: "my-app"
    static_configs:
      - targets:
          - "app1.example.com:9090"
          - "app2.example.com:9090"
    metrics_path: "/metrics"
    scheme: "https"
    tls_config:
      ca_file: /etc/prometheus/ca.crt
      cert_file: /etc/prometheus/client.crt
      key_file: /etc/prometheus/client.key

参考: 见 assets/prometheus.yml.template

抓取配置

静态目标

scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets: ["host1:9100", "host2:9100"]
        labels:
          env: "production"
          region: "us-west-2"

基于文件的服务发现

scrape_configs:
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
          - /etc/prometheus/targets/*.yml
        refresh_interval: 5m

targets/production.json:

[
  {
    "targets": ["app1:9090", "app2:9090"],
    "labels": {
      "env": "production",
      "service": "api"
    }
  }
]

Kubernetes服务发现

scrape_configs:
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

参考: 见 references/scrape-configs.md

记录规则

为频繁查询的表达式创建预计算指标:

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: api_metrics
    interval: 15s
    rules:
      # 每个服务的HTTP请求率
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # 错误率百分比
      - record: job:http_requests_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_requests_error_rate:percentage
        expr: |
          (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

      # P95延迟
      - record: job:http_request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: resource_metrics
    interval: 30s
    rules:
      # CPU利用率百分比
      - record: instance:node_cpu:utilization
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # 内存利用率百分比
      - record: instance:node_memory:utilization
        expr: |
          100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

      # 磁盘使用百分比
      - record: instance:node_disk:utilization
        expr: |
          100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

参考: 见 references/recording-rules.md

警报规则

# /etc/prometheus/rules/alert_rules.yml
groups:
  - name: availability
    interval: 30s
    rules:
      - alert: ServiceDown
        expr: up{job="my-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.instance }} 已宕机"
          description: "{{ $labels.job }} 已宕机超过1分钟"

      - alert: HighErrorRate
        expr: job:http_requests_error_rate:percentage > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.job }} 的错误率过高"
          description: "错误率为 {{ $value }}% (阈值: 5%)"

      - alert: HighLatency
        expr: job:http_request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.job }} 的延迟过高"
          description: "P95延迟为 {{ $value }}s (阈值: 1s)"

  - name: resources
    interval: 1m
    rules:
      - alert: HighCPUUsage
        expr: instance:node_cpu:utilization > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} 的CPU使用率过高"
          description: "CPU使用率为 {{ $value }}%"

      - alert: HighMemoryUsage
        expr: instance:node_memory:utilization > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} 的内存使用率过高"
          description: "内存使用率为 {{ $value }}%"

      - alert: DiskSpaceLow
        expr: instance:node_disk:utilization > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} 的磁盘空间不足"
          description: "磁盘使用率为 {{ $value }}%"

验证

# 验证配置
promtool check config prometheus.yml

# 验证规则
promtool check rules /etc/prometheus/rules/*.yml

# 测试查询
promtool query instant http://localhost:9090 'up'

参考: 见 scripts/validate-prometheus.sh

最佳实践

使用一致的命名 用于指标 (prefix_name_unit)
设置适当的抓取间隔 (通常15-60秒)
使用记录规则 用于昂贵查询
实现高可用性 (多个Prometheus实例)
根据存储容量配置保留期
使用重新标记 进行指标清理
监控Prometheus自身
实施联邦 用于大规模部署
使用Thanos/Cortex 进行长期存储
记录自定义指标

故障排除

检查抓取目标:

curl http://localhost:9090/api/v1/targets

检查配置:

curl http://localhost:9090/api/v1/status/config

测试查询:

curl 'http://localhost:9090/api/v1/query?query=up'

参考文件

assets/prometheus.yml.template - 完整配置模板
references/scrape-configs.md - 抓取配置模式
references/recording-rules.md - 记录规则示例
scripts/validate-prometheus.sh - 验证脚本

Prometheus配置Skill prometheus-configuration