name: 仪器化规划 description: 在实施前规划仪器化策略,覆盖仪器化内容、命名约定、基数管理和仪器化预算 allowed-tools: 读取、全局搜索、grep
仪器化规划
在实施前进行应用程序仪器化的战略规划。
何时使用此技能
- 为新服务规划仪器化
- 审查仪器化策略
- 建立命名约定
- 管理遥测基数
- 设置仪器化预算
仪器化策略框架
仪器化内容
仪器化层级:
┌─────────────────────────────────────────────────────────────────┐
│ 层级 1:自动/库仪器化 │
│ - HTTP 客户端/服务器(自动捕获) │
│ - 数据库客户端(自动捕获) │
│ - 消息队列客户端(自动捕获) │
│ - 框架提供的指标 │
│ 努力程度:低 | 覆盖范围:广 | 定制性:有限 │
├─────────────────────────────────────────────────────────────────┤
│ 层级 2:业务事务仪器化 │
│ - 关键用户旅程 │
│ - 业务操作(结账、注册等) │
│ - 收入生成流程 │
│ - SLA 绑定操作 │
│ 努力程度:中等 | 覆盖范围:目标化 | 价值:高 │
├─────────────────────────────────────────────────────────────────┤
│ 层级 3:调试/诊断仪器化 │
│ - 算法热点路径 │
│ - 缓存行为 │
│ - 断路器状态 │
│ - 重试/回退路径 │
│ 努力程度:中等 | 覆盖范围:深入 | 用途:故障排除 │
├─────────────────────────────────────────────────────────────────┤
│ 层级 4:业务指标 │
│ - 领域特定计数器 │
│ - 转化率 │
│ - 功能使用情况 │
│ - 客户行为 │
│ 努力程度:高 | 覆盖范围:定制 | 价值:业务洞察 │
└─────────────────────────────────────────────────────────────────┘
仪器化决策矩阵
instrumentation_decisions:
always_instrument:
- "入站 HTTP/gRPC 请求"
- "出站 HTTP/gRPC 调用"
- "数据库查询"
- "消息发布/消费"
- "认证/授权"
- "外部 API 调用"
- "缓存操作"
consider_instrumenting:
- "复杂业务逻辑"
- "功能标志评估"
- "后台作业"
- "计划任务"
- "文件 I/O 操作"
- "CPU 密集型操作"
avoid_instrumenting:
- "每个方法调用(噪声过大)"
- "紧密循环(性能影响)"
- "数据转换(价值低)"
- "验证辅助函数"
- "实用函数"
decision_criteria:
business_value:
weight: 0.3
question: "这有助于理解业务成果吗?"
debugging_value:
weight: 0.25
question: "这有助于诊断生产问题吗?"
slo_relevance:
weight: 0.25
question: "这有助于 SLI 测量吗?"
cost_impact:
weight: 0.2
question: "基数/容量可接受吗?"
命名约定
指标命名
metric_naming:
format: "[命名空间]_[子系统]_[名称]_[单位]"
rules:
case: "蛇形命名法"
unit_suffix: "始终包含单位后缀(_seconds、_bytes、_total)"
base_units: "使用基本单位(秒而非毫秒)"
counter_suffix: "计数器使用 _total 后缀"
examples:
good:
- "http_server_requests_total"
- "http_server_request_duration_seconds"
- "http_server_response_size_bytes"
- "db_connections_current"
- "order_processing_duration_seconds"
- "payment_transactions_total"
bad:
- "requests(无单位、无命名空间)"
- "HttpRequestDuration(错误命名法)"
- "order_latency_ms(使用基本单位)"
- "totalOrders(驼峰命名法、无单位)"
label_naming:
case: "蛇形命名法"
avoid:
- "名称中嵌入值(path=/users)"
- "高基数标签"
good_labels:
- "method, status_code, path"
- "service, version, environment"
bad_labels:
- "user_id(高基数)"
- "request_id(高基数)"
- "timestamp(非维度)"
跨度命名
span_naming:
format: "[操作] [资源]"
rules:
- "使用动词+名词模式"
- "保持名称低基数"
- "包含操作类型,而非具体值"
- "跨服务保持一致"
examples:
http:
pattern: "HTTP {METHOD} {route_template}"
good: "HTTP GET /users/{id}"
bad: "HTTP GET /users/12345"
database:
pattern: "{operation} {table}"
good: "SELECT orders"
bad: "SELECT * FROM orders WHERE id=123"
messaging:
pattern: "{operation} {queue/topic}"
good: "PUBLISH order-events"
bad: "发布消息到 order-events 队列"
rpc:
pattern: "{service}/{method}"
good: "OrderService/CreateOrder"
bad: "grpc 调用到 order 服务"
attributes:
required:
- "service.name"
- "service.version"
- "deployment.environment"
recommended:
http:
- "http.method"
- "http.route"
- "http.status_code"
- "http.target"
database:
- "db.system"
- "db.name"
- "db.operation"
- "db.statement(已清理)"
messaging:
- "messaging.system"
- "messaging.destination"
- "messaging.operation"
日志字段命名
log_naming:
format: "所有字段使用蛇形命名法"
standard_fields:
timestamp: "ISO 8601 格式"
level: "INFO、WARN、ERROR 等"
message: "人类可读描述"
service: "服务名称"
trace_id: "关联 ID"
span_id: "当前跨度"
domain_fields:
pattern: "{领域}_{字段}"
examples:
- "order_id"
- "customer_id"
- "payment_amount"
- "product_sku"
avoid:
- "嵌套对象(扁平化以便索引)"
- "未知长度数组"
- "大文本块"
- "敏感数据(PII、秘密)"
基数管理
理解基数
基数 = 唯一时间序列数量
示例:
http_requests_total{method="GET", path="/api/users", status="200"}
基数 = 方法 × 路径 × 状态
= 5 × 100 × 10
= 5,000 个时间序列
加入 user_id(1M 用户):
= 5 × 100 × 10 × 1,000,000
= 5,000,000,000 个时间序列 ← 爆炸!
基数预算
cardinality_budget:
planning:
total_budget: 100000 # 每个服务目标最大时间序列
allocation:
automatic_instrumentation: 30% # 30,000
business_transactions: 40% # 40,000
custom_metrics: 20% # 20,000
buffer: 10% # 10,000
per_metric_limits:
low_cardinality:
max_series: 100
example: "状态码、方法"
medium_cardinality:
max_series: 1000
example: "端点、操作"
high_cardinality:
max_series: 10000
example: "按小时聚合"
requires: "需要理由和批准"
monitoring:
- "基数增长 > 10% 每天时告警"
- "每周基数审查"
- "自动标签值限制"
基数减少技术
cardinality_reduction:
bucketing:
before: "path=/users/12345"
after: "path=/users/{id}"
technique: "路径模板提取"
sampling:
description: "采样高容量、低价值的追踪"
strategies:
head_sampling: "在追踪开始时决定"
tail_sampling: "看到完整追踪后决定"
adaptive: "基于容量调整率"
aggregation:
description: "在导出前预聚合"
example: "按状态计数,而非每个请求"
value_limiting:
description: "限制每个标签的唯一值数量"
example: "最多 100 个唯一路径,然后使用 '其他'"
dropping:
description: "丢弃低价值维度"
candidates:
- "实例 ID(使用服务名称)"
- "请求 ID(不用于指标)"
- "完整 URL(使用路由模板)"
仪器化预算
性能影响
performance_budget:
cpu_overhead:
target: "< 1% CPU 增加"
measurement: "配置有/无仪器化"
memory_overhead:
target: "< 50MB 额外堆内存"
components:
- "指标注册表"
- "跨度缓冲区"
- "日志缓冲区"
latency_overhead:
target: "< 1ms 每个请求"
hot_paths: "< 100μs"
data_volume:
metrics:
target: "< 1GB/天 每个服务"
calculation: "序列 × 抓取间隔 × 8 字节"
traces:
target: "< 10GB/天 每个服务(带采样)"
sampling_rate: "1-10% 用于高容量服务"
logs:
target: "< 5GB/天 每个服务"
strategies: "采样、级别门控"
成本规划
cost_planning:
estimation_formula:
metrics:
monthly_cost: "时间序列 × $0.003(典型云定价)"
example: "10,000 序列 × $0.003 = $30/月"
traces:
monthly_cost: "每月跨度数量 × $0.000005"
example: "100M 跨度 × $0.000005 = $500/月"
logs:
monthly_cost: "每月 GB × $0.50"
example: "500GB × $0.50 = $250/月"
optimization_strategies:
- "增加抓取间隔(15s → 60s)"
- "减少追踪采样率"
- "生产中日志级别门控"
- "调试数据保留时间缩短"
- "旧指标降采样"
仪器化计划模板
instrumentation_plan:
service: "{服务名称}"
version: "1.0"
date: "{日期}"
owner: "{团队}"
objectives:
- "跟踪订单处理的 SLI"
- "启用分布式追踪用于调试"
- "监控支付成功率"
automatic_instrumentation:
framework: "OpenTelemetry .NET"
enabled:
- "ASP.NET Core(HTTP 服务器)"
- "HttpClient(HTTP 客户端)"
- "Entity Framework Core(数据库)"
- "Azure.Messaging.ServiceBus"
configuration:
sampling_rate: 0.1 # 10% 的追踪
batch_export_interval: 5000 # 毫秒
custom_spans:
- name: "ProcessOrder"
purpose: "跟踪订单处理时长"
attributes:
- "order.id"
- "order.item_count"
- "order.total_amount"
events:
- "inventory.reserved"
- "payment.processed"
- name: "ValidatePayment"
purpose: "跟踪支付验证步骤"
attributes:
- "payment.method"
- "payment.provider"
sensitive: false
custom_metrics:
counters:
- name: "orders_total"
labels: ["status", "payment_method"]
purpose: "按结果计数订单"
cardinality_estimate: 20
- name: "payment_failures_total"
labels: ["reason", "provider"]
purpose: "跟踪支付失败原因"
cardinality_estimate: 50
histograms:
- name: "order_processing_duration_seconds"
labels: ["order_type"]
purpose: "跟踪订单处理延迟"
buckets: [0.1, 0.5, 1, 2, 5, 10]
cardinality_estimate: 10
gauges:
- name: "pending_orders_current"
labels: []
purpose: "当前待处理订单数"
cardinality_estimate: 1
cardinality_summary:
estimated_total: 81
budget: 1000
status: "在预算内"
log_strategy:
production_level: "INFO"
structured_fields:
standard:
- "trace_id"
- "span_id"
- "service"
- "environment"
domain:
- "order_id"
- "customer_id(哈希处理)"
sampling:
debug_logs: "生产中 1%"
cost_estimate:
monthly:
metrics: "$30"
traces: "$200"
logs: "$150"
total: "$380"
review_schedule:
frequency: "每季度"
metrics_to_review:
- "基数增长"
- "数据量"
- "成本与预算对比"
相关技能
observability-patterns- 三支柱概述distributed-tracing- 追踪实现细节slo-sli-error-budget- 用于 SLO 的测量内容
最后更新: 2025-12-26