name: 错误侦探 description: 搜索日志和代码库以查找错误模式、堆栈跟踪和异常。在调试问题、分析日志或调查生产错误时使用。

错误检测

在日志和代码中查找和分析错误。

使用场景

调查生产错误
分析日志模式
查找错误根本原因
跨系统关联错误

日志分析

查找错误

# 最近错误
grep -i "error\|exception\|fatal" /var/log/app.log | tail -100

# 带上下文的错误
grep -B 5 -A 10 "ERROR" /var/log/app.log

# 按错误类型计数
grep -oE "Error: [^:]*" app.log | sort | uniq -c | sort -rn

# 时间范围内的错误
awk '/2024-01-15 14:/ && /ERROR/' app.log

模式检测

# 查找重复错误
grep "ERROR" app.log | cut -d']' -f2 | sort | uniq -c | sort -rn | head -20

# 关联请求ID
grep "req-12345" *.log | sort -t' ' -k1,2

# 查找错误峰值
grep "ERROR" app.log | cut -d' ' -f1-2 | uniq -c | sort -rn

堆栈跟踪分析

解析堆栈跟踪

import re

def parse_stack_trace(log_content: str) -> list[dict]:
    pattern = r'(?P<exception>\w+Error|\w+Exception): (?P<message>.*?)
(?P<trace>(?:\s+at .+
)+)'

    traces = []
    for match in re.finditer(pattern, log_content):
        traces.append({
            'type': match.group('exception'),
            'message': match.group('message'),
            'trace': match.group('trace').strip().split('
')
        })
    return traces

常见模式

模式	指示	动作
NullPointer	缺少空值检查	添加验证
Timeout	依赖缓慢	添加超时，重试
Connection refused	服务宕机	检查健康，重试
OOM	内存泄漏	性能分析，增加限制
Rate limit	请求过多	添加退避，队列

调查清单

捕获 - 获取完整错误消息和堆栈跟踪
时间戳 - 何时开始？
频率 - 多久一次？是否增加？
范围 - 所有用户还是特定用户？
变更 - 最近部署？
依赖 - 外部服务是否受影响？

关联查询

-- 按端点统计错误
SELECT endpoint, count(*) as errors
FROM logs
WHERE level = 'ERROR' AND time > NOW() - INTERVAL '1 hour'
GROUP BY endpoint ORDER BY errors DESC;

-- 错误率随时间变化
SELECT
  date_trunc('minute', time) as minute,
  count(*) filter (where level = 'ERROR') as errors,
  count(*) as total
FROM logs
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY minute ORDER BY minute;

示例

输入: “查找API返回500错误的原因” 操作: 在日志中搜索500状态码，查找堆栈跟踪，识别根本原因

输入: “分析过去一小时的错误模式” 操作: 按类型聚合错误，查找峰值，与事件关联