仪器化规划Skill instrumentation-planning

这个技能用于规划应用程序的仪器化(监控)策略,覆盖监控内容、命名约定、基数管理和预算,以提高可观测性和运维效率,适用于DevOps和软件开发生命周期。关键词:仪器化、监控、命名约定、基数管理、DevOps、可观测性、性能预算。

DevOps 0 次安装 0 次浏览 更新于 3/11/2026

name: 仪器化规划 description: 在实施前规划仪器化策略,覆盖仪器化内容、命名约定、基数管理和仪器化预算 allowed-tools: 读取、全局搜索、grep

仪器化规划

在实施前进行应用程序仪器化的战略规划。

何时使用此技能

  • 为新服务规划仪器化
  • 审查仪器化策略
  • 建立命名约定
  • 管理遥测基数
  • 设置仪器化预算

仪器化策略框架

仪器化内容

仪器化层级:

┌─────────────────────────────────────────────────────────────────┐
│  层级 1:自动/库仪器化                                          │
│  - HTTP 客户端/服务器(自动捕获)                               │
│  - 数据库客户端(自动捕获)                                     │
│  - 消息队列客户端(自动捕获)                                    │
│  - 框架提供的指标                                               │
│  努力程度:低 | 覆盖范围:广 | 定制性:有限                      │
├─────────────────────────────────────────────────────────────────┤
│  层级 2:业务事务仪器化                                         │
│  - 关键用户旅程                                                 │
│  - 业务操作(结账、注册等)                                     │
│  - 收入生成流程                                                 │
│  - SLA 绑定操作                                                 │
│  努力程度:中等 | 覆盖范围:目标化 | 价值:高                   │
├─────────────────────────────────────────────────────────────────┤
│  层级 3:调试/诊断仪器化                                        │
│  - 算法热点路径                                                 │
│  - 缓存行为                                                     │
│  - 断路器状态                                                   │
│  - 重试/回退路径                                                │
│  努力程度:中等 | 覆盖范围:深入 | 用途:故障排除               │
├─────────────────────────────────────────────────────────────────┤
│  层级 4:业务指标                                               │
│  - 领域特定计数器                                               │
│  - 转化率                                                       │
│  - 功能使用情况                                                 │
│  - 客户行为                                                     │
│  努力程度:高 | 覆盖范围:定制 | 价值:业务洞察                 │
└─────────────────────────────────────────────────────────────────┘

仪器化决策矩阵

instrumentation_decisions:
  always_instrument:
    - "入站 HTTP/gRPC 请求"
    - "出站 HTTP/gRPC 调用"
    - "数据库查询"
    - "消息发布/消费"
    - "认证/授权"
    - "外部 API 调用"
    - "缓存操作"

  consider_instrumenting:
    - "复杂业务逻辑"
    - "功能标志评估"
    - "后台作业"
    - "计划任务"
    - "文件 I/O 操作"
    - "CPU 密集型操作"

  avoid_instrumenting:
    - "每个方法调用(噪声过大)"
    - "紧密循环(性能影响)"
    - "数据转换(价值低)"
    - "验证辅助函数"
    - "实用函数"

  decision_criteria:
    business_value:
      weight: 0.3
      question: "这有助于理解业务成果吗?"

    debugging_value:
      weight: 0.25
      question: "这有助于诊断生产问题吗?"

    slo_relevance:
      weight: 0.25
      question: "这有助于 SLI 测量吗?"

    cost_impact:
      weight: 0.2
      question: "基数/容量可接受吗?"

命名约定

指标命名

metric_naming:
  format: "[命名空间]_[子系统]_[名称]_[单位]"

  rules:
    case: "蛇形命名法"
    unit_suffix: "始终包含单位后缀(_seconds、_bytes、_total)"
    base_units: "使用基本单位(秒而非毫秒)"
    counter_suffix: "计数器使用 _total 后缀"

  examples:
    good:
      - "http_server_requests_total"
      - "http_server_request_duration_seconds"
      - "http_server_response_size_bytes"
      - "db_connections_current"
      - "order_processing_duration_seconds"
      - "payment_transactions_total"

    bad:
      - "requests(无单位、无命名空间)"
      - "HttpRequestDuration(错误命名法)"
      - "order_latency_ms(使用基本单位)"
      - "totalOrders(驼峰命名法、无单位)"

  label_naming:
    case: "蛇形命名法"
    avoid:
      - "名称中嵌入值(path=/users)"
      - "高基数标签"
    good_labels:
      - "method, status_code, path"
      - "service, version, environment"
    bad_labels:
      - "user_id(高基数)"
      - "request_id(高基数)"
      - "timestamp(非维度)"

跨度命名

span_naming:
  format: "[操作] [资源]"

  rules:
    - "使用动词+名词模式"
    - "保持名称低基数"
    - "包含操作类型,而非具体值"
    - "跨服务保持一致"

  examples:
    http:
      pattern: "HTTP {METHOD} {route_template}"
      good: "HTTP GET /users/{id}"
      bad: "HTTP GET /users/12345"

    database:
      pattern: "{operation} {table}"
      good: "SELECT orders"
      bad: "SELECT * FROM orders WHERE id=123"

    messaging:
      pattern: "{operation} {queue/topic}"
      good: "PUBLISH order-events"
      bad: "发布消息到 order-events 队列"

    rpc:
      pattern: "{service}/{method}"
      good: "OrderService/CreateOrder"
      bad: "grpc 调用到 order 服务"

  attributes:
    required:
      - "service.name"
      - "service.version"
      - "deployment.environment"

    recommended:
      http:
        - "http.method"
        - "http.route"
        - "http.status_code"
        - "http.target"

      database:
        - "db.system"
        - "db.name"
        - "db.operation"
        - "db.statement(已清理)"

      messaging:
        - "messaging.system"
        - "messaging.destination"
        - "messaging.operation"

日志字段命名

log_naming:
  format: "所有字段使用蛇形命名法"

  standard_fields:
    timestamp: "ISO 8601 格式"
    level: "INFO、WARN、ERROR 等"
    message: "人类可读描述"
    service: "服务名称"
    trace_id: "关联 ID"
    span_id: "当前跨度"

  domain_fields:
    pattern: "{领域}_{字段}"
    examples:
      - "order_id"
      - "customer_id"
      - "payment_amount"
      - "product_sku"

  avoid:
    - "嵌套对象(扁平化以便索引)"
    - "未知长度数组"
    - "大文本块"
    - "敏感数据(PII、秘密)"

基数管理

理解基数

基数 = 唯一时间序列数量

示例:
http_requests_total{method="GET", path="/api/users", status="200"}

基数 = 方法 × 路径 × 状态
            = 5 × 100 × 10
            = 5,000 个时间序列

加入 user_id(1M 用户):
            = 5 × 100 × 10 × 1,000,000
            = 5,000,000,000 个时间序列 ← 爆炸!

基数预算

cardinality_budget:
  planning:
    total_budget: 100000  # 每个服务目标最大时间序列
    allocation:
      automatic_instrumentation: 30%  # 30,000
      business_transactions: 40%      # 40,000
      custom_metrics: 20%             # 20,000
      buffer: 10%                     # 10,000

  per_metric_limits:
    low_cardinality:
      max_series: 100
      example: "状态码、方法"

    medium_cardinality:
      max_series: 1000
      example: "端点、操作"

    high_cardinality:
      max_series: 10000
      example: "按小时聚合"
      requires: "需要理由和批准"

  monitoring:
    - "基数增长 > 10% 每天时告警"
    - "每周基数审查"
    - "自动标签值限制"

基数减少技术

cardinality_reduction:
  bucketing:
    before: "path=/users/12345"
    after: "path=/users/{id}"
    technique: "路径模板提取"

  sampling:
    description: "采样高容量、低价值的追踪"
    strategies:
      head_sampling: "在追踪开始时决定"
      tail_sampling: "看到完整追踪后决定"
      adaptive: "基于容量调整率"

  aggregation:
    description: "在导出前预聚合"
    example: "按状态计数,而非每个请求"

  value_limiting:
    description: "限制每个标签的唯一值数量"
    example: "最多 100 个唯一路径,然后使用 '其他'"

  dropping:
    description: "丢弃低价值维度"
    candidates:
      - "实例 ID(使用服务名称)"
      - "请求 ID(不用于指标)"
      - "完整 URL(使用路由模板)"

仪器化预算

性能影响

performance_budget:
  cpu_overhead:
    target: "< 1% CPU 增加"
    measurement: "配置有/无仪器化"

  memory_overhead:
    target: "< 50MB 额外堆内存"
    components:
      - "指标注册表"
      - "跨度缓冲区"
      - "日志缓冲区"

  latency_overhead:
    target: "< 1ms 每个请求"
    hot_paths: "< 100μs"

  data_volume:
    metrics:
      target: "< 1GB/天 每个服务"
      calculation: "序列 × 抓取间隔 × 8 字节"

    traces:
      target: "< 10GB/天 每个服务(带采样)"
      sampling_rate: "1-10% 用于高容量服务"

    logs:
      target: "< 5GB/天 每个服务"
      strategies: "采样、级别门控"

成本规划

cost_planning:
  estimation_formula:
    metrics:
      monthly_cost: "时间序列 × $0.003(典型云定价)"
      example: "10,000 序列 × $0.003 = $30/月"

    traces:
      monthly_cost: "每月跨度数量 × $0.000005"
      example: "100M 跨度 × $0.000005 = $500/月"

    logs:
      monthly_cost: "每月 GB × $0.50"
      example: "500GB × $0.50 = $250/月"

  optimization_strategies:
    - "增加抓取间隔(15s → 60s)"
    - "减少追踪采样率"
    - "生产中日志级别门控"
    - "调试数据保留时间缩短"
    - "旧指标降采样"

仪器化计划模板

instrumentation_plan:
  service: "{服务名称}"
  version: "1.0"
  date: "{日期}"
  owner: "{团队}"

  objectives:
    - "跟踪订单处理的 SLI"
    - "启用分布式追踪用于调试"
    - "监控支付成功率"

  automatic_instrumentation:
    framework: "OpenTelemetry .NET"
    enabled:
      - "ASP.NET Core(HTTP 服务器)"
      - "HttpClient(HTTP 客户端)"
      - "Entity Framework Core(数据库)"
      - "Azure.Messaging.ServiceBus"
    configuration:
      sampling_rate: 0.1  # 10% 的追踪
      batch_export_interval: 5000  # 毫秒

  custom_spans:
    - name: "ProcessOrder"
      purpose: "跟踪订单处理时长"
      attributes:
        - "order.id"
        - "order.item_count"
        - "order.total_amount"
      events:
        - "inventory.reserved"
        - "payment.processed"

    - name: "ValidatePayment"
      purpose: "跟踪支付验证步骤"
      attributes:
        - "payment.method"
        - "payment.provider"
      sensitive: false

  custom_metrics:
    counters:
      - name: "orders_total"
        labels: ["status", "payment_method"]
        purpose: "按结果计数订单"
        cardinality_estimate: 20

      - name: "payment_failures_total"
        labels: ["reason", "provider"]
        purpose: "跟踪支付失败原因"
        cardinality_estimate: 50

    histograms:
      - name: "order_processing_duration_seconds"
        labels: ["order_type"]
        purpose: "跟踪订单处理延迟"
        buckets: [0.1, 0.5, 1, 2, 5, 10]
        cardinality_estimate: 10

    gauges:
      - name: "pending_orders_current"
        labels: []
        purpose: "当前待处理订单数"
        cardinality_estimate: 1

  cardinality_summary:
    estimated_total: 81
    budget: 1000
    status: "在预算内"

  log_strategy:
    production_level: "INFO"
    structured_fields:
      standard:
        - "trace_id"
        - "span_id"
        - "service"
        - "environment"
      domain:
        - "order_id"
        - "customer_id(哈希处理)"
    sampling:
      debug_logs: "生产中 1%"

  cost_estimate:
    monthly:
      metrics: "$30"
      traces: "$200"
      logs: "$150"
      total: "$380"

  review_schedule:
    frequency: "每季度"
    metrics_to_review:
      - "基数增长"
      - "数据量"
      - "成本与预算对比"

相关技能

  • observability-patterns - 三支柱概述
  • distributed-tracing - 追踪实现细节
  • slo-sli-error-budget - 用于 SLO 的测量内容

最后更新: 2025-12-26