name: coralogix-analysis description: 使用DataPrime查询语言进行Coralogix日志分析。在查询Coralogix日志、指标或追踪时使用。提供语法参考和智能调查脚本。

Coralogix 分析

认证

重要：凭据由代理层自动注入。请勿检查环境变量中的 CORALOGIX_API_KEY 或其他API密钥——它们对您不可见。直接运行脚本即可；认证已透明处理。

您可以检查的配置环境变量（非机密）：

CORALOGIX_DOMAIN - 团队主机名（例如，myteam.app.cx498.coralogix.com）
CORALOGIX_REGION - 区域代码（例如，us2、eu1）- 如果未设置域则备用

区域映射（脚本基于域自动检测）：

US1：*.app.coralogix.us → api.us1.coralogix.com
US2：*.app.cx498.coralogix.com → api.us2.coralogix.com
EU1：*.coralogix.com → api.eu1.coralogix.com
EU2：*.app.eu2.coralogix.com → api.eu2.coralogix.com
AP1：*.app.coralogix.in → api.ap1.coralogix.com
AP2：*.app.coralogixsg.com → api.ap2.coralogix.com

强制要求：统计优先调查

切勿转储原始日志。 始终遵循此模式：

统计 → 采样 → 签名 → 关联

统计优先 - 在采样前了解数量、错误率和顶部模式
战略采样 - 基于统计选择正确的策略
模式提取 - 聚类类似错误以找到根本原因
上下文关联 - 围绕异常时间戳进行调查

可用脚本

所有脚本位于 .claude/skills/observability-coralogix/scripts/

主要调查脚本

get_statistics.py - 始终从这里开始

带有模式提取和异常检测的全面统计。

python .claude/skills/observability-coralogix/scripts/get_statistics.py [--service SERVICE] [--app APP] [--time-range MINUTES]

# 示例：
python .claude/skills/observability-coralogix/scripts/get_statistics.py --time-range 60
python .claude/skills/observability-coralogix/scripts/get_statistics.py --service payment --app otel-demo

输出包括：

总计数、错误计数、错误率百分比
严重性分布
顶部错误模式（快速分类的关键）
时间桶异常（通过z分数检测峰值/下降）
按日志量排序的顶部服务
可操作建议

sample_logs.py - 战略采样

基于统计选择正确的采样策略。

python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy STRATEGY [--service SERVICE] [--app APP]

# 策略：
#   errors_only   - 仅错误/严重日志（默认用于事件）
#   around_anomaly - 特定时间戳时间窗口内的日志
#   first_last    - 前N/2 + 后N/2日志（时间线视图）
#   random        - 跨时间范围的随机样本
#   all           - 所有严重性级别（谨慎使用）

# 示例：
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy errors_only --service payment
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy around_anomaly --timestamp "2026-01-27T05:00:00Z" --window 60
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy first_last --service checkout --limit 50

extract_signatures.py - 模式聚类

标准化和聚类日志消息以查看唯一问题模式。

python .claude/skills/observability-coralogix/scripts/extract_signatures.py --service SERVICE [--severity SEVERITY] [--max-signatures N]

# 示例：
python .claude/skills/observability-coralogix/scripts/extract_signatures.py --service payment --severity ERROR
python .claude/skills/observability-coralogix/scripts/extract_signatures.py --app otel-demo --max-signatures 30

标准化变量部分（UUID、IP、时间戳、数字）以找到：

主导错误模式（> 50% = 可能单一根本原因）
多样化错误（许多模式 = 多个问题）
每个模式受影响的服务

实用脚本

list_services.py - 服务发现

python .claude/skills/observability-coralogix/scripts/list_services.py [--time-range MINUTES]

get_health.py - 快速健康检查

python .claude/skills/observability-coralogix/scripts/get_health.py <service> [--time-range MINUTES]

get_errors.py - 快速错误获取

python .claude/skills/observability-coralogix/scripts/get_errors.py <service> [--app APPLICATION] [--time-range MINUTES]

query_logs.py - 原始DataPrime查询

用于其他脚本未涵盖的自定义查询。

python .claude/skills/observability-coralogix/scripts/query_logs.py "<dataprime_query>" [--time-range MINUTES] [--limit N]

DataPrime 语法快速参考

过滤器

# 相等性（使用 == 而不是 =）
$l.subsystemname == 'api-server'

# 严重性 - 使用枚举值（不加引号！）
# 有效值：VERBOSE, DEBUG, INFO, WARNING, ERROR, CRITICAL
$m.severity == ERROR
$m.severity == WARNING || $m.severity == ERROR

# 文本搜索（不区分大小写）- 使用 ~~ 而不是 'contains'
$d ~~ 'timeout'
$d ~~ 'connection refused'

# 用 && 组合过滤器
$l.subsystemname == 'payment' && $m.severity == ERROR

聚合

# 计数
| aggregate count() as total

# 按字段分组
| groupby $l.subsystemname aggregate count() as cnt

# 时间分桶
| timebucket 5m aggregate count() as cnt

# 多重聚合
| groupby $l.subsystemname aggregate count() as cnt, avg($d.duration) as avg_duration

# 排序和限制
| orderby cnt desc | limit 20

常见字段

$l.applicationname - 应用程序/环境名称（例如，“otel-demo”）
$l.subsystemname - 服务名称（例如，“payment”、“checkout”）
$m.severity - 日志级别枚举：VERBOSE, DEBUG, INFO, WARNING, ERROR, CRITICAL
$m.timestamp - 事件时间戳
$d - 日志消息/数据（使用 ~~ 进行文本搜索）

常见查询模式

1. 列出所有服务及其日志计数

source logs | groupby $l.subsystemname aggregate count() as cnt | orderby cnt desc | limit 30

2. 按服务错误计数

source logs | filter $m.severity == ERROR | groupby $l.subsystemname aggregate count() as errors | orderby errors desc

3. 随时间错误率

source logs | filter $m.severity == ERROR | groupby $m.timestamp / 5m as bucket aggregate count() as errors | orderby bucket asc

4. 特定服务错误

source logs | filter $l.subsystemname == 'payment' | filter $m.severity == ERROR | limit 50

5. 搜索特定错误消息

source logs | filter $d ~~ 'connection refused' | limit 20

高级 DataPrime 模式

特殊字段的括号表示法

K8s字段通常名称中有点。使用括号表示法：

# 错误 - 视为嵌套路径
$d.kubernetes.namespace

# 正确 - 带点的字面字段名
$d['kubernetes.namespace']
$d['resource.attributes.k8s_pod_name']

基于时间的比较

比较时间阈值前后的日志：

# 计数过去一小时与更早的日志
source logs | countby if($m.timestamp > now() - 1h, 'last_hour', 'older')

# 查找早于5分钟的日志
source logs | filter $m.timestamp < now() - 5m

K8s 容器重启

查找不稳定容器：

source logs
| choose resource.attributes.k8s_container_restart_count:number as restarts,
         resource.attributes.k8s_container_name as container,
         resource.attributes.k8s_deployment_name as deployment
| filter restarts > 0
| groupby deployment aggregate max(restarts) as max_restarts
| orderby max_restarts desc

峰值错误窗口

查找错误最多的10分钟窗口：

source logs
| filter $m.severity == ERROR
| groupby $m.timestamp / 10m as bucket aggregate count() as cnt
| orderby cnt desc
| limit 5

所有字段模糊搜索

当不知道哪个字段包含值时：

# 在所有字段中搜索文本
source logs | filter $d ~~ 'connection refused'

# 或使用 wildfind
source logs | wildfind 'timeout'

要避免的反模式

❌ 切勿跳过统计 - get_statistics.py 是强制性的第一步
❌ 无限制查询 - 始终指定时间范围和限制
❌ 引用严重性值 - 使用枚举：ERROR 而不是 'ERROR'
❌ 使用 ‘contains’ - 使用 ~~ 操作符进行文本搜索
❌ 缺少应用程序过滤器 - 对于多租户，按 $l.applicationname 过滤
❌ 获取所有日志 - 使用采样策略，而不是 limit 10000
❌ 忽略异常时间戳 - 使用 around_anomaly 调查峰值
❌ 不提取模式读取日志 - 始终提取签名以进行根本原因分析
❌ K8s字段的点表示法 - 使用括号表示法：$d['k8s.pod.name']

调查工作流程

标准事件调查

┌─────────────────────────────────────────────────────────────┐
│ 1. 统计优先（强制性）                                          │
│    python get_statistics.py --service <service>              │
│    → 了解数量、错误率、顶部模式、异常                          │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
                     主导问题？
               ┌─────────────┴─────────────┐
               │                           │
      是 (>80% 单一模式)                    否 (混合错误)
               │                           │
               ▼                           ▼
┌─────────────────────────────┐  ┌───────────────────────────────────────────┐
│ 2. 快速路径                  │  │ 2. 深度潜水                               │
│    直接采样错误              │  │    python extract_signatures.py           │
│    python sample_logs.py    │  │    python sample_logs.py --strategy ...   │
│    → 验证假设                │  │    → 聚类和分析模式                       │
└─────────────────────────────┘  └───────────────────────────────────────────┘

示例：支付服务调查

# 步骤1：统计优先 - 始终
python .claude/skills/observability-coralogix/scripts/get_statistics.py --service payment --time-range 60
# 输出：15,432 条日志，847 个错误（5.5%），顶部模式：“下游连接超时”

# 如果找到主导模式：
# 步骤2：用样本验证
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy errors_only --service payment --limit 10

快速命令参考

目标	命令
开始调查	`get_statistics.py --service X`
查看错误多样性	`extract_signatures.py --service X`
仅采样错误	`sample_logs.py --strategy errors_only --service X`
调查峰值	`sample_logs.py --strategy around_anomaly --timestamp T`
时间线视图	`sample_logs.py --strategy first_last --service X`
列出所有服务	`list_services.py`
自定义查询	`query_logs.py "source logs

追踪调查

使用追踪来理解跨服务的请求流和延迟。

何时使用追踪与日志

用例	工具
“发生了什么错误？”	日志（`get_statistics.py`）
“为什么这个请求慢？”	追踪（`get_slow_spans.py`）
“请求在哪里失败？”	追踪（`get_traces.py`）
“服务依赖是什么？”	追踪（操作分析）

追踪脚本

get_traces.py - 查找跨度

# 获取服务的跨度
python .claude/skills/observability-coralogix/scripts/get_traces.py --service checkout --time-range 30

# 获取跟踪ID的所有跨度
python .claude/skills/observability-coralogix/scripts/get_traces.py --trace-id abc123def456

# 按操作过滤
python .claude/skills/observability-coralogix/scripts/get_traces.py --operation "/api/checkout" --service checkout

get_slow_spans.py - 延迟分析

# 查找慢于500ms的跨度
python .claude/skills/observability-coralogix/scripts/get_slow_spans.py --min-duration 500

# 查找特定服务中的慢跨度
python .claude/skills/observability-coralogix/scripts/get_slow_spans.py --min-duration 200 --service checkout

# 获取按服务的延迟统计（推荐第一步）
python .claude/skills/observability-coralogix/scripts/get_slow_spans.py --stats

DataPrime 跨度语法

跨度使用 source spans，但字段名与日志不同：

# 列出服务的跨度（使用 serviceName，而不是 $l.subsystemname）
source spans | filter serviceName == 'checkout' | limit 50

# 查找慢跨度（持续时间以微秒为单位）
source spans | filter duration > 500000 | orderby duration desc | limit 20

# 获取跟踪的所有跨度（使用顶级 traceID）
source spans | filter traceID == 'abc123def456...' | limit 100

# 按服务的延迟统计
source spans | groupby serviceName aggregate avg(duration) as avg_dur, max(duration) as max_dur | orderby avg_dur desc

跨度字段参考（与日志不同！）

operationName - 操作名称（例如，HTTP GET /checkout）
serviceName - 服务名称（等同于日志的 $l.subsystemname）
applicationName - 应用程序名称
duration - 跨度持续时间，以微秒为单位
traceID - 跟踪标识符（32位十六进制）
spanID - 跨度标识符
parentId - 父跨度ID（用于跟踪树）
tags - 跨度元数据（例如，http.status_code、rpc.method）
process.tags - 资源属性（例如，k8s.pod.name）

跟踪调查工作流程

┌─────────────────────────────────────────────────────────────┐
│ 1. 检查延迟统计                                               │
│    python get_slow_spans.py --stats                          │
│    → 查看哪些服务有高延迟                                      │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. 查找慢跨度                                                 │
│    python get_slow_spans.py --min-duration 500 --service X   │
│    → 获取带有跟踪ID的特定慢跨度                                │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. 跟踪完整请求                                               │
│    python get_traces.py --trace-id <id>                      │
│    → 查看慢请求中的所有跨度                                    │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. 与日志关联                                                 │
│    python sample_logs.py --strategy around_anomaly           │
│    → 获取相同时间戳附近的日志                                  │
└─────────────────────────────────────────────────────────────┘