name: 分布式追踪 description: 使用Jaeger和Tempo实现分布式追踪,跟踪微服务间的请求并识别性能瓶颈。适用于调试微服务、分析请求流程或为分布式系统实现可观测性时。
分布式追踪
使用Jaeger和Tempo实现分布式追踪,以获取微服务间的请求流可见性。
目的
跟踪分布式系统中的请求,以理解延迟、依赖关系和故障点。
何时使用
- 调试延迟问题
- 理解服务依赖关系
- 识别瓶颈
- 追踪错误传播
- 分析请求路径
分布式追踪概念
追踪结构
追踪(请求ID:abc123)
↓
跨度(前端)[100ms]
↓
跨度(API网关)[80ms]
├→ 跨度(认证服务)[10ms]
└→ 跨度(用户服务)[60ms]
└→ 跨度(数据库)[40ms]
关键组件
- 追踪 - 端到端请求旅程
- 跨度 - 追踪中的单个操作
- 上下文 - 在服务间传播的元数据
- 标签 - 用于过滤的键值对
- 日志 - 跨度内的带时间戳事件
Jaeger 设置
Kubernetes 部署
# 部署 Jaeger 操作符
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
# 部署 Jaeger 实例
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF
Docker Compose
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # 收集器
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
参考: 查看 references/jaeger-setup.md
应用仪表化
OpenTelemetry(推荐)
Python(Flask)
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask
# 初始化追踪器
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# 仪表化 Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# 业务逻辑
users = fetch_users_from_db()
return {"users": users}
def fetch_users_from_db():
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# 数据库查询
return query_database()
Node.js(Express)
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");
// 初始化追踪器
const provider = new NodeTracerProvider({
resource: { attributes: { "service.name": "my-service" } },
});
const exporter = new JaegerExporter({
endpoint: "http://jaeger:14268/api/traces",
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// 仪表化库
registerInstrumentations({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});
const express = require("express");
const app = express();
app.get("/api/users", async (req, res) => {
const tracer = trace.getTracer("my-service");
const span = tracer.startSpan("get_users");
try {
const users = await fetchUsers();
span.setAttributes({ "user.count": users.length });
res.json({ users });
} finally {
span.end();
}
});
Go
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}
参考: 查看 references/instrumentation.md
上下文传播
HTTP 头
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
HTTP 请求中的传播
Python
from opentelemetry.propagate import inject
headers = {}
inject(headers) # 注入追踪上下文
response = requests.get('http://downstream-service/api', headers=headers)
Node.js
const { propagation } = require("@opentelemetry/api");
const headers = {};
propagation.inject(context.active(), headers);
axios.get("http://downstream-service/api", { headers });
Tempo 设置(Grafana)
Kubernetes 部署
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-config
参考: 查看 assets/jaeger-config.yaml.template
采样策略
概率采样
# 采样1%的追踪
sampler:
type: probabilistic
param: 0.01
速率限制采样
# 每秒最多采样100个追踪
sampler:
type: ratelimiting
param: 100
自适应采样
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# 基于追踪ID采样(确定性)
sampler = ParentBased(root=TraceIdRatioBased(0.01))
追踪分析
查找慢请求
Jaeger 查询:
service=my-service
duration > 1s
查找错误
Jaeger 查询:
service=my-service
error=true
tags.http.status_code >= 500
服务依赖图
Jaeger 自动生成服务依赖图,显示:
- 服务关系
- 请求率
- 错误率
- 平均延迟
最佳实践
- 适当采样(生产环境中1-10%)
- 添加有意义标签(user_id, request_id)
- 传播上下文跨所有服务边界
- 在跨度中记录异常
- 使用一致的操作命名
- 监控追踪开销(<1% CPU影响)
- 设置追踪错误警报
- 实现分布式上下文(行李)
- 使用跨度事件记录重要里程碑
- 文档化仪表化标准
与日志集成
关联日志
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"处理请求",
extra={"trace_id": format(trace_id, '032x')}
)
故障排除
无追踪出现:
- 检查收集器端点
- 验证网络连接
- 检查采样配置
- 查看应用日志
高延迟开销:
- 降低采样率
- 使用批量跨度处理器
- 检查导出器配置
参考文件
references/jaeger-setup.md- Jaeger 安装references/instrumentation.md- 仪表化模式assets/jaeger-config.yaml.template- Jaeger 配置
相关技能
prometheus-configuration- 用于指标grafana-dashboards- 用于可视化slo-implementation- 用于延迟SLOs