结构化日志记录
核心原则
每个请求每个服务发出一个宽泛的事件,而不是分散的日志语句。
- 在请求生命周期中构建上下文
- 在最后(在一个finally块中)发出一次
- 优化查询,而不是编写
- 包括高基数字段(user_id, request_id)以便于调试
- 包括高维度(50+字段)捕获完整上下文
反模式
避免这些常见错误:
// 字符串连接 - 失去结构
log("User " + userId + " payment failed: " + error)
// 分散的语句 - 一个请求17行
log("Starting request")
log("Validating token")
log("Token valid")
log("Fetching user")
log("User found")
log("Processing payment")
log("Payment failed")
// 只有低基数 - 无法调试特定用户
log({level: "error", message: "Payment failed"})
// 缺少业务上下文 - 没有调试能力
log({user_id: "123", error: "failed"})
宽事件结构
每个请求一个全面事件:
{
"timestamp": "2025-01-15T10:23:45.612Z",
"level": "error",
"request_id": "req_8bf7ec2d",
"trace_id": "abc123",
"span_id": "span_456",
"service": "checkout-service",
"version": "2.4.1",
"deployment_id": "deploy_789",
"region": "us-east-1",
"method": "POST",
"path": "/api/checkout",
"status_code": 500,
"duration_ms": 1247,
"user": {
"id": "user_456",
"subscription": "premium",
"account_age_days": 847,
"lifetime_value_cents": 284700
},
"cart": {
"id": "cart_xyz",
"item_count": 3,
"total_cents": 15999,
"coupon_applied": "SAVE20"
},
"payment": {
"method": "card",
"provider": "stripe",
"latency_ms": 1089,
"attempt": 3
},
"error": {
"type": "PaymentError",
"code": "card_declined",
"message": "Card declined by issuer",
"retriable": false,
"stripe_decline_code": "insufficient_funds"
},
"feature_flags": {
"new_checkout_flow": true,
"express_payment": false
},
"timing": {
"db_query_ms": 45,
"external_api_ms": 1089,
"db_queries_count": 3,
"cache_hit": false
}
}
有关全面字段参考,请参见references/wide-event-fields.md。
日志级别
使用日志级别来表示严重性和所需行动:
| 级别 | 何时使用 | 示例 |
|---|---|---|
| ERROR | 请求失败,需要调查 | 未处理的异常,5xx响应,数据损坏 |
| WARN | 降级但已恢复 | 重试成功,回退使用,接近速率限制,调用已弃用的API |
| INFO | 正常请求完成 | 成功请求的标准日志行 |
| DEBUG | 详细诊断 | 缓存查找,查询计划,中间状态(通常在生产中抽样) |
宽事件的level字段反映了请求的最坏结果:
- 请求正常成功:
info - 请求在重试/回退后成功:
warn - 请求失败:
error
示例
坏:分散日志
log.info("Incoming request", {path: "/checkout"})
log.debug("Fetching user")
log.debug("User found", {user_id: "123"})
log.debug("Fetching cart")
log.info("Processing payment")
log.warn("Payment attempt 1 failed, retrying")
log.warn("Payment attempt 2 failed, retrying")
log.error("Payment failed after 3 attempts")
好:单个宽事件
log.error({
request_id: "req_abc",
path: "/checkout",
method: "POST",
status_code: 500,
duration_ms: 3200,
user: {id: "123", subscription: "premium"},
cart: {item_count: 3, total_cents: 15999},
payment: {
provider: "stripe",
attempts: 3,
latency_ms: 2800
},
error: {
type: "PaymentError",
code: "card_declined",
retriable: false
}
})
坏:缺少上下文
log.error("Database query failed", {error: "timeout"})
好:完整上下文
log.error({
request_id: "req_xyz",
user: {id: "user_789", subscription: "enterprise"},
operation: "fetch_orders",
database: {
host: "db-prod-01",
query_duration_ms: 30000,
timeout_ms: 30000,
table: "orders",
rows_scanned: 1500000
},
error: {
type: "DatabaseError",
code: "QUERY_TIMEOUT",
message: "Query exceeded 30s timeout",
retriable: true,
suggestion: "Add index on orders.user_id or paginate results"
}
})
实施模式
在请求过程中构建事件,最后发出一次:
function loggingMiddleware(handler):
return function(request):
start_time = now()
// 初始化宽事件
event = {
request_id: request.id,
trace_id: request.trace_id,
timestamp: start_time,
method: request.method,
path: request.path,
service: SERVICE_NAME,
version: SERVICE_VERSION
}
// 使事件对处理程序可用
request.wide_event = event
try:
response = handler(request)
event.status_code = response.status
event.level = "info"
return response
catch error:
event.status_code = 500
event.level = "error"
event.error = {
type: error.name,
message: error.message,
code: error.code,
retriable: error.retriable ?? false
}
throw error
finally:
event.duration_ms = now() - start_time
logger.log(event)
在处理程序中,用业务上下文丰富:
function checkoutHandler(request):
event = request.wide_event
// 添加用户上下文
user = getUser(request)
event.user = {
id: user.id,
subscription: user.tier,
account_age_days: daysSince(user.created_at),
lifetime_value_cents: user.ltv
}
// 添加业务上下文
cart = getCart(user.id)
event.cart = {
id: cart.id,
item_count: cart.items.length,
total_cents: cart.total
}
// 跟踪外部调用的时间
payment_start = now()
result = processPayment(cart)
event.payment = {
provider: "stripe",
latency_ms: now() - payment_start,
attempt: result.attempt_number
}
return response(result)
查询能力
宽事件使查询成为可能,而分散的日志无法实现:
-- 查找所有新结账流程中的高级用户结账失败
SELECT * FROM logs
WHERE path = '/checkout'
AND status_code >= 500
AND user.subscription = 'premium'
AND feature_flags.new_checkout_flow = true
AND timestamp > now() - interval '1 hour'
-- 按错误代码和用户层级分析支付失败
SELECT
error.code,
user.subscription,
count(*) as failures,
avg(payment.latency_ms) as avg_latency
FROM logs
WHERE error.type = 'PaymentError'
GROUP BY error.code, user.subscription
-- 查找企业客户的慢速请求
SELECT * FROM logs
WHERE duration_ms > 2000
AND user.subscription = 'enterprise'
ORDER BY duration_ms DESC