名称: grafana 描述: | 使用Grafana和LGTM堆栈进行可观测性可视化。仪表板设计、面板配置、告警、变量/模板化和数据源。
使用时机: 创建Grafana仪表板、配置面板和可视化、编写LogQL/TraceQL查询、设置Grafana数据源、配置仪表板变量和模板、构建Grafana告警。 不要使用: 用于编写PromQL查询(使用/prometheus)、用于告警规则策略(使用/prometheus)、用于一般可观测性架构(使用具有基础设施重点的高级软件工程师)。
触发词: grafana, dashboard, panel, visualization, logql, traceql, loki, tempo, mimir, data source, annotation, variable, template, row, stat, graph, table, heatmap, gauge, bar chart, pie chart, time series, logs panel, traces panel, LGTM stack。 触发词:
- grafana
- dashboard
- panel
- visualization
- logql
- traceql
- loki
- tempo
- mimir
- data source
- annotation
- variable
- template
- row
- stat
- graph
- table
- heatmap
- gauge
- bar chart
- pie chart
- time series
- logs panel
- traces panel
- LGTM stack 允许工具: Read, Grep, Glob, Edit, Write, Bash
Grafana和LGTM堆栈技能
概述
LGTM堆栈提供完整的可观测性解决方案,具有全面的可视化和仪表板能力:
- Loki: 日志聚合和查询(LogQL)
- Grafana: 可视化、仪表板、告警和探索
- Tempo: 分布式追踪(TraceQL)
- Mimir: 长期指标存储(Prometheus兼容)
此技能涵盖设置、配置、仪表板创建、面板设计、查询、告警、模板化和生产可观测性最佳实践。
何时使用此技能
主要使用场景
- 创建或修改Grafana仪表板
- 设计面板和可视化(图形、统计、表格、热图等)
- 编写查询(PromQL、LogQL、TraceQL)
- 配置数据源(Prometheus、Loki、Tempo、Mimir)
- 设置告警规则和通知策略
- 实现仪表板变量和模板
- 仪表板供应和GitOps工作流
- 故障排除可观测性查询
- 分析应用性能、错误或系统行为
谁使用此技能
- 高级软件工程师(主要):生产可观测性设置、LGTM堆栈部署、仪表板架构(使用基础设施技能进行部署)
- 软件工程师: 应用仪表板、服务指标可视化
LGTM堆栈组件
Loki - 日志聚合
架构 - Loki
水平可扩展的日志聚合,受Prometheus启发
- 仅索引元数据(标签),不索引日志内容
- 使用对象存储(S3、GCS等)的成本有效存储
- LogQL查询语言类似于PromQL
关键概念 - Loki
- 标签用于索引(低基数)
- 日志流由唯一标签集标识
- 解析器: logfmt、JSON、正则表达式、模式
- 行过滤器和标签过滤器
Grafana - 可视化
功能
- 多数据源仪表板
- 面板类型: 图形、统计、表格、热图、条形图、饼图、仪表、日志、追踪、时间序列
- 模板化和变量用于动态仪表板
- 告警(统一告警,具有联系点和通知策略)
- 仪表板供应和GitOps集成
- 基于角色的访问控制(RBAC)
- 探索模式用于临时查询
- 注释用于事件标记
- 仪表板文件夹和组织
Tempo - 分布式追踪
架构 - Tempo
可扩展的分布式追踪后端
- 成本有效的追踪存储
- TraceQL用于追踪查询
- 与日志和指标集成(追踪到日志、追踪到指标)
- OpenTelemetry兼容
Mimir - 指标存储
架构 - Mimir
水平可扩展的长期Prometheus存储
- 多租户支持
- 查询联邦
- 高可用性
- Prometheus remote_write兼容
仪表板设计和最佳实践
仪表板组织原则
- 层次结构: 概览 -> 服务 -> 组件 -> 深入分析
- 黄金信号: 延迟、流量、错误、饱和度(RED/USE方法)
- 变量驱动: 使用模板实现跨环境灵活性
- 一致布局: 网格对齐(24列网格),逻辑从上到下流
- 性能: 限制查询,使用查询缓存,适当的时间间隔
面板类型和使用时机
| 面板类型 | 使用场景 | 最适合 |
|---|---|---|
| 时间序列 / 图形 | 随时间趋势 | 请求率、延迟、资源使用 |
| 统计 | 单个指标值 | 错误率、当前值、百分比 |
| 仪表 | 向限制的进度 | CPU使用率、内存、磁盘空间 |
| 条形仪表 | 比较值 | 前N项、分布 |
| 表格 | 结构化数据 | 服务列表、错误详情、资源清单 |
| 饼图 | 比例 | 流量分布、错误分解 |
| 热图 | 随时间分布 | 延迟百分位数、请求模式 |
| 日志 | 日志流 | 错误调查、调试 |
| 追踪 | 分布式追踪 | 性能分析、依赖映射 |
面板配置最佳实践
标题和描述
- 清晰、描述性标题: 包括单位和指标上下文
- 工具提示: 添加描述字段用于面板文档
- 示例:
- 好: “P95延迟(秒)按端点”
- 坏: “延迟”
图例和标签
- 仅在需要时显示图例(多个系列)
- 使用
{{label}}格式用于动态图例名称 - 适当放置图例(底部、右侧或隐藏)
- 按值排序以显示前N
轴和单位
- 始终用单位标记轴
- 使用适当的单位格式(秒、字节、百分比、请求/秒)
- 设置合理的最大最小范围以避免误导尺度
- 对宽值范围使用对数尺度
阈值和颜色
- 使用阈值用于视觉提示(绿色/黄色/红色)
- 标准阈值模式:
- 绿色: 正常操作
- 黄色: 警告(可能需要行动)
- 红色: 关键(需要立即关注)
- 示例:
- 错误率: 0%(绿色)、1%(黄色)、5%(红色)
- P95延迟: <1s(绿色)、1-3s(黄色)、>3s(红色)
链接和深入分析
- 链接面板到相关仪表板
- 使用数据链接用于上下文(日志、追踪、相关服务)
- 创建深入路径: 概览 -> 服务 -> 组件 -> 详情
- 链接到告警面板的运行时手册
仪表板变量和模板化
仪表板变量实现可重用、动态仪表板,跨环境、服务和时间范围工作。
变量类型
| 类型 | 目的 | 示例 |
|---|---|---|
| 查询 | 从数据源填充 | 命名空间、服务、Pods |
| 自定义 | 静态选项列表 | 环境(生产/预演/开发) |
| 间隔 | 时间间隔选择 | 自动调整查询间隔 |
| 数据源 | 在数据源间切换 | 多个Prometheus实例 |
| 常量 | 查询的隐藏值 | 集群名称、区域 |
| 文本框 | 自由形式输入 | 自定义过滤器 |
常见变量模式
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"description": "选择Prometheus数据源"
},
{
"name": "namespace",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info, namespace)",
"multi": true,
"includeAll": true,
"description": "Kubernetes命名空间过滤器"
},
{
"name": "app",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
"multi": true,
"includeAll": true,
"description": "应用过滤器(依赖于命名空间)"
},
{
"name": "interval",
"type": "interval",
"auto": true,
"auto_count": 30,
"auto_min": "10s",
"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
"description": "查询分辨率间隔"
},
{
"name": "environment",
"type": "custom",
"options": [
{ "text": "生产", "value": "prod" },
{ "text": "预演", "value": "staging" },
{ "text": "开发", "value": "dev" }
],
"current": { "text": "生产", "value": "prod" }
}
]
}
}
变量在查询中使用
变量引用使用 $variable_name 或 ${variable_name} 语法:
# 简单变量引用
rate(http_requests_total{namespace="$namespace"}[5m])
# 多选与正则匹配
rate(http_requests_total{namespace=~"$namespace"}[5m])
# 变量在图例中
rate(http_requests_total{app="$app"}[5m]) by (method)
# 图例格式: "{{method}}"
# 使用间隔变量进行自适应查询
rate(http_requests_total[$__interval])
# 链式变量(应用依赖于命名空间)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])
高级变量技术
正则过滤:
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"regex": "/^$app-.*/",
"description": "按应用前缀过滤Pods"
}
所有选项与自定义值:
{
"name": "status",
"type": "custom",
"options": ["200", "404", "500"],
"includeAll": true,
"allValue": ".*",
"description": "HTTP状态码过滤器"
}
依赖变量(变量链):
$datasource(数据源类型)$cluster(查询:依赖于数据源)$namespace(查询:依赖于集群)$app(查询:依赖于命名空间)$pod(查询:依赖于应用)
注释
注释在时间序列面板上显示事件作为垂直标记:
{
"annotations": {
"list": [
{
"name": "部署",
"datasource": "Prometheus",
"expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
"tagKeys": "deployment,namespace",
"textFormat": "部署: {{deployment}}",
"iconColor": "蓝色"
},
{
"name": "告警",
"datasource": "Loki",
"expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
"textFormat": "告警: {{alertname}}",
"iconColor": "红色"
}
]
}
}
仪表板性能优化
查询优化
- 限制面板数量(< 15个每个仪表板)
- 使用适当的时间范围(避免查询数月)
- 利用
$__interval进行自适应采样 - 避免高基数分组(过多系列)
- 使用查询缓存当可用时
面板性能
- 设置最大数据点到合理值
- 使用即时查询用于当前状态面板
- 可能时将相关指标合并到单一查询
- 禁用重仪表板上的自动刷新
仪表板作为代码和供应
仪表板供应
仪表板供应支持GitOps工作流和版本控制仪表板定义。
供应提供者配置
文件:/etc/grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
- name: "默认"
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
- name: "应用"
orgId: 1
folder: "应用"
type: file
disableDeletion: true
editable: false
options:
path: /var/lib/grafana/dashboards/application
- name: "基础设施"
orgId: 1
folder: "基础设施"
type: file
options:
path: /var/lib/grafana/dashboards/infrastructure
仪表板JSON结构
带元数据和供应的完整仪表板JSON:
{
"dashboard": {
"title": "应用可观测性 - ${app}",
"uid": "app-observability",
"tags": ["可观测性", "应用"],
"timezone": "浏览器",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s",
"templating": { "list": [] },
"panels": [],
"links": []
},
"overwrite": true,
"folderId": null,
"folderUid": null
}
Kubernetes ConfigMap供应
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
application-dashboard.json: |
{
"dashboard": {
"title": "应用指标",
"uid": "app-metrics",
"tags": ["应用"],
"panels": []
}
}
Grafana Operator(CRD)
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: application-observability
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"dashboard": {
"title": "应用可观测性",
"panels": []
}
}
数据源供应
Loki数据源
文件:/etc/grafana/provisioning/datasources/loki.yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo_uid
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
editable: false
Tempo数据源
文件:/etc/grafana/provisioning/datasources/tempo.yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo_uid
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: loki_uid
tags: ["job", "instance", "pod", "namespace"]
mappedTags: [{ key: "service.name", value: "service" }]
spanStartTimeShift: "1h"
spanEndTimeShift: "1h"
tracesToMetrics:
datasourceUid: prometheus_uid
tags: [{ key: "service.name", value: "service" }]
serviceMap:
datasourceUid: prometheus_uid
nodeGraph:
enabled: true
editable: false
Mimir/Prometheus数据源
文件:/etc/grafana/provisioning/datasources/mimir.yaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir:8080/prometheus
uid: prometheus_uid
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- datasourceUid: tempo_uid
name: trace_id
prometheusType: Mimir
prometheusVersion: 2.40.0
cacheLevel: "高"
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m
editable: false
告警
告警规则配置
Grafana统一告警支持多数据源告警,具有灵活评估和路由。
Prometheus/Mimir告警规则
文件:/etc/grafana/provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- uid: error_rate_high
title: 高错误率
condition: A
data:
- refId: A
queryType: ""
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus_uid
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
intervalMs: 1000
maxDataPoints: 43200
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: '错误率为 {{ printf "%.2f" $values.A.Value }}% (阈值: 5%)'
summary: 应用错误率超过阈值
runbook_url: https://wiki.company.com/runbooks/high-error-rate
labels:
severity: critical
team: platform
isPaused: false
- uid: high_latency
title: 高P95延迟
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
annotations:
description: "P95延迟为 {{ $values.A.Value }}s 在端点 {{ $labels.endpoint }}"
runbook_url: https://wiki.company.com/runbooks/high-latency
labels:
severity: warning
Loki告警规则
apiVersion: 1
groups:
- name: log_based_alerts
interval: 1m
rules:
- uid: error_spike
title: 错误日志激增
condition: A
data:
- refId: A
queryType: ""
datasourceUid: loki_uid
model:
expr: |
sum(rate({app="api"} | json | level="error" [5m]))
> 10
for: 2m
annotations:
description: "错误日志率为 {{ $values.A.Value }} 日志/秒"
summary: 检测到错误日志激增
labels:
severity: warning
- uid: critical_error_pattern
title: 检测到关键错误模式
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum(count_over_time({app="api"}
|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
)) > 0
for: 0m
annotations:
description: "在日志中发现关键错误模式"
labels:
severity: critical
page: true
联系点和通知策略
文件:/etc/grafana/provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-critical
receivers:
- uid: slack_critical
type: slack
settings:
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
title: "{{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
*告警:* {{ .Labels.alertname }}
*摘要:* {{ .Annotations.summary }}
*描述:* {{ .Annotations.description }}
*严重性:* {{ .Labels.severity }}
{{ end }}
disableResolveMessage: false
- orgId: 1
name: pagerduty-oncall
receivers:
- uid: pagerduty_oncall
type: pagerduty
settings:
integrationKey: YOUR_INTEGRATION_KEY
severity: critical
class: infrastructure
- orgId: 1
name: email-team
receivers:
- uid: email_team
type: email
settings:
addresses: team@company.com
singleEmail: true
notificationPolicies:
- orgId: 1
receiver: slack-critical
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: pagerduty-oncall
matchers:
- severity = critical
- page = true
group_wait: 10s
repeat_interval: 1h
continue: true
- receiver: email-team
matchers:
- severity = warning
- team = platform
group_interval: 10m
repeat_interval: 12h
LogQL查询模式
基础日志查询
流选择
# 简单标签匹配
{namespace="production", app="api"}
# 正则匹配
{app=~"api|web|worker"}
# 不等于
{env!="staging"}
# 多个条件
{namespace="production", app="api", level!="debug"}
行过滤器
# 包含
{app="api"} |= "error"
# 不包含
{app="api"} != "debug"
# 正则匹配
{app="api"} |~ "error|exception|fatal"
# 大小写不敏感
{app="api"} |~ "(?i)error"
# 链式过滤器
{app="api"} |= "error" != "timeout"
解析和提取
JSON解析
# 解析JSON日志
{app="api"} | json
# 提取特定字段
{app="api"} | json message="msg", level="severity"
# 过滤提取字段
{app="api"} | json | level="error"
# 嵌套JSON
{app="api"} | json | line_format "{{.response.status}}"
Logfmt解析
# 解析logfmt(key=value)
{app="api"} | logfmt
# 提取特定字段
{app="api"} | logfmt level, caller, msg
# 过滤解析字段
{app="api"} | logfmt | level="error"
模式解析
# 用模式提取
{app="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <_>`
# 过滤提取值
{app="nginx"} | pattern `<_> <status> <_>` | status >= 400
# 复杂模式
{app="api"} | pattern `level=<level> msg="<msg>" duration=<duration>ms`
聚合和指标
计数查询
# 随时间日志行计数
count_over_time({app="api"}[5m])
# 日志率
rate({app="api"}[5m])
# 每秒错误数
sum(rate({app="api"} |= "error" [5m])) by (namespace)
# 错误比率
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))
提取指标
# 平均持续时间
avg_over_time({app="api"}
| logfmt
| unwrap duration [5m]) by (endpoint)
# P95延迟
quantile_over_time(0.95, {app="api"}
| regexp `duration=(?P<duration>[0-9.]+)ms`
| unwrap duration [5m]) by (method)
# 前10错误消息
topk(10,
sum by (msg) (
count_over_time({app="api"}
| json
| level="error" [1h]
)
)
)
TraceQL查询模式
基础追踪查询
# 按服务查找追踪
{ .service.name = "api" }
# HTTP状态码
{ .http.status_code = 500 }
# 组合条件
{ .service.name = "api" && .http.status_code >= 400 }
# 持续时间过滤器
{ duration > 1s }
高级TraceQL
# 父-子关系
{ .service.name = "frontend" }
>> { .service.name = "backend" && .http.status_code = 500 }
# 后代跨度
{ .service.name = "api" }
>>+ { .db.system = "postgresql" && duration > 1s }
# 失败数据库查询
{ .service.name = "api" }
>> { .db.system = "postgresql" && status = "error" }
完整仪表板示例
应用可观测性仪表板
{
"dashboard": {
"title": "应用可观测性 - ${app}",
"tags": ["可观测性", "应用"],
"timezone": "浏览器",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "app",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up, app)",
"current": {
"selected": false,
"text": "api",
"value": "api"
}
},
{
"name": "namespace",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up{app=\"$app\"}, namespace)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "请求率",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
"legendFormat": "{{method}} - {{status}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"yaxes": [
{
"format": "reqps",
"label": "请求/秒"
}
]
},
{
"id": 2,
"title": "P95延迟",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"yaxes": [
{
"format": "s",
"label": "持续时间"
}
],
"thresholds": [
{
"value": 1,
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt"
}
]
},
{
"id": 3,
"title": "错误率",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
"legendFormat": "错误 %"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"yaxes": [
{
"format": "percentunit",
"max": 1,
"min": 0
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.01],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
],
"frequency": "1m",
"handler": 1,
"name": "错误率告警",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 4,
"title": "近期错误日志",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": false,
"showCommonLabels": false,
"wrapLogMessage": true,
"dedupStrategy": "none",
"enableLogDetails": true
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
}
],
"links": [
{
"title": "探索日志",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
"type": "link",
"icon": "doc"
},
{
"title": "探索追踪",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
"type": "link",
"icon": "gf-traces"
}
]
}
}
LGTM堆栈配置
Loki配置
文件:loki.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
limits_config:
retention_period: 744h # 31天
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_series: 500
max_query_lookback: 30d
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
Tempo配置
文件:tempo.yaml
server:
http_listen_port: 3200
grpc_listen_port: 9096
distributor:
receivers:
otlp:
protocols:
http:
grpc:
jaeger:
protocols:
thrift_http:
grpc:
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 720h # 30天
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /var/tempo/wal
metrics_generator:
registry:
external_labels:
source: tempo
cluster: primary
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: true
生产最佳实践
性能优化
查询优化
- 使用标签过滤器先行过滤器
- 限制昂贵查询的时间范围
- 可能时使用
unwrap而不是解析 - 使用查询前端缓存查询结果
仪表板性能
- 限制面板数量(< 15个每个仪表板)
- 使用适当的时间间隔
- 避免高基数分组
- 使用
$__interval进行自适应采样
存储优化
- 配置保留策略
- 使用Loki和Tempo的压缩
- 实施分层存储(热/温/冷)
- 监控存储增长
安全最佳实践
身份验证
- 启用身份验证(Loki/Tempo中
auth_enabled: true) - 使用OAuth/LDAP用于Grafana
- 实现带组织隔离的多租户
授权
- 在Grafana中配置RBAC
- 按团队限制数据源访问
- 使用仪表板文件夹权限
网络安全
- 所有组件的TLS
- Kubernetes中的网络策略
- 入口处的速率限制
故障排除
常见问题
-
高基数: 过多唯一标签组合
- 解决方案: 减少标签维度,使用日志解析代替
-
查询超时: 大数据集上的复杂查询
- 解决方案: 减少时间范围,使用聚合,添加查询限制
-
存储增长: 无限保留
- 解决方案: 配置保留策略,启用压缩
-
缺少追踪: 不完整的追踪数据
- 解决方案: 检查采样率,验证检测