分布式追踪Skill distributed-tracing

分布式追踪技能用于在微服务架构中实现请求的端到端跟踪,帮助识别性能瓶颈、分析服务依赖和调试错误。通过使用Jaeger和Tempo等工具,收集和可视化追踪数据,提高系统的可观测性。关键词:分布式追踪,微服务,Jaeger,Tempo,性能调试,可观测性,请求跟踪,DevOps,云原生。

微服务 0 次安装 0 次浏览 更新于 3/22/2026

name: 分布式追踪 description: 使用Jaeger和Tempo实现分布式追踪,跟踪微服务间的请求并识别性能瓶颈。适用于调试微服务、分析请求流程或为分布式系统实现可观测性时。

分布式追踪

使用Jaeger和Tempo实现分布式追踪,以获取微服务间的请求流可见性。

目的

跟踪分布式系统中的请求,以理解延迟、依赖关系和故障点。

何时使用

  • 调试延迟问题
  • 理解服务依赖关系
  • 识别瓶颈
  • 追踪错误传播
  • 分析请求路径

分布式追踪概念

追踪结构

追踪(请求ID:abc123)
  ↓
跨度(前端)[100ms]
  ↓
跨度(API网关)[80ms]
  ├→ 跨度(认证服务)[10ms]
  └→ 跨度(用户服务)[60ms]
      └→ 跨度(数据库)[40ms]

关键组件

  • 追踪 - 端到端请求旅程
  • 跨度 - 追踪中的单个操作
  • 上下文 - 在服务间传播的元数据
  • 标签 - 用于过滤的键值对
  • 日志 - 跨度内的带时间戳事件

Jaeger 设置

Kubernetes 部署

# 部署 Jaeger 操作符
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# 部署 Jaeger 实例
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
EOF

Docker Compose

version: "3.8"
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686" # UI
      - "14268:14268" # 收集器
      - "14250:14250" # gRPC
      - "9411:9411" # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411

参考: 查看 references/jaeger-setup.md

应用仪表化

OpenTelemetry(推荐)

Python(Flask)

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

# 初始化追踪器
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# 仪表化 Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/api/users')
def get_users():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("user.count", 100)
        # 业务逻辑
        users = fetch_users_from_db()
        return {"users": users}

def fetch_users_from_db():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("database_query") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", "SELECT * FROM users")
        # 数据库查询
        return query_database()

Node.js(Express)

const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
  ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");

// 初始化追踪器
const provider = new NodeTracerProvider({
  resource: { attributes: { "service.name": "my-service" } },
});

const exporter = new JaegerExporter({
  endpoint: "http://jaeger:14268/api/traces",
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// 仪表化库
registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});

const express = require("express");
const app = express();

app.get("/api/users", async (req, res) => {
  const tracer = trace.getTracer("my-service");
  const span = tracer.startSpan("get_users");

  try {
    const users = await fetchUsers();
    span.setAttributes({ "user.count": users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});

Go

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}

参考: 查看 references/instrumentation.md

上下文传播

HTTP 头

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

HTTP 请求中的传播

Python

from opentelemetry.propagate import inject

headers = {}
inject(headers)  # 注入追踪上下文

response = requests.get('http://downstream-service/api', headers=headers)

Node.js

const { propagation } = require("@opentelemetry/api");

const headers = {};
propagation.inject(context.active(), headers);

axios.get("http://downstream-service/api", { headers });

Tempo 设置(Grafana)

Kubernetes 部署

apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: tempo
          image: grafana/tempo:latest
          args:
            - -config.file=/etc/tempo/tempo.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/tempo
      volumes:
        - name: config
          configMap:
            name: tempo-config

参考: 查看 assets/jaeger-config.yaml.template

采样策略

概率采样

# 采样1%的追踪
sampler:
  type: probabilistic
  param: 0.01

速率限制采样

# 每秒最多采样100个追踪
sampler:
  type: ratelimiting
  param: 100

自适应采样

from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# 基于追踪ID采样(确定性)
sampler = ParentBased(root=TraceIdRatioBased(0.01))

追踪分析

查找慢请求

Jaeger 查询:

service=my-service
duration > 1s

查找错误

Jaeger 查询:

service=my-service
error=true
tags.http.status_code >= 500

服务依赖图

Jaeger 自动生成服务依赖图,显示:

  • 服务关系
  • 请求率
  • 错误率
  • 平均延迟

最佳实践

  1. 适当采样(生产环境中1-10%)
  2. 添加有意义标签(user_id, request_id)
  3. 传播上下文跨所有服务边界
  4. 在跨度中记录异常
  5. 使用一致的操作命名
  6. 监控追踪开销(<1% CPU影响)
  7. 设置追踪错误警报
  8. 实现分布式上下文(行李)
  9. 使用跨度事件记录重要里程碑
  10. 文档化仪表化标准

与日志集成

关联日志

import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "处理请求",
        extra={"trace_id": format(trace_id, '032x')}
    )

故障排除

无追踪出现:

  • 检查收集器端点
  • 验证网络连接
  • 检查采样配置
  • 查看应用日志

高延迟开销:

  • 降低采样率
  • 使用批量跨度处理器
  • 检查导出器配置

参考文件

  • references/jaeger-setup.md - Jaeger 安装
  • references/instrumentation.md - 仪表化模式
  • assets/jaeger-config.yaml.template - Jaeger 配置

相关技能

  • prometheus-configuration - 用于指标
  • grafana-dashboards - 用于可视化
  • slo-implementation - 用于延迟SLOs