name: data-visualization description: 创建有效的数据可视化、图表、仪表板和报告，覆盖分析、基础设施监控和机器学习领域。包括库选择、UX设计和可访问性。触发关键词：图表、图形、绘图、仪表板、报告、可视化、matplotlib、plotly、d3、seaborn、grafana、tableau、superset、metabase、KPI、指标、分析、直方图、热图、时间序列、散点图、条形图。 allowed-tools: Read, Grep, Glob, Edit, Write, Bash

数据可视化

概述

本技能创建有效的数据可视化，清晰地传达洞察力，覆盖多个领域：分析仪表板、基础设施监控和机器学习模型性能跟踪。它结合了数据分析专业知识、仪表板架构模式、UX设计原则和可访问性考虑。

指令

1. 理解数据和上下文

分析数据结构、类型和粒度
识别关键指标、维度和关系
确定要讲述的分析问题或故事
考虑目标受众及其技术水平
评估更新频率和实时需求
识别特定领域需求（分析、监控、机器学习）

2. 选择可视化领域

分析仪表板：业务指标、KPI、趋势、比较

工具：Tableau、Superset、Metabase、Plotly Dash、Streamlit
重点：执行摘要、向下钻取探索、报告生成

基础设施监控：系统健康、资源使用、警报

工具：Grafana、Prometheus、Datadog、CloudWatch
重点：实时指标、警报阈值、SLA跟踪

机器学习模型性能：训练指标、预测、模型诊断

工具：TensorBoard、Weights & Biases、MLflow、matplotlib
重点：损失曲线、混淆矩阵、特征重要性

3. 选择图表类型和布局

根据数据关系匹配图表类型（参见图表选择指南）
考虑数据量和复杂性
规划交互性需求（工具提示、过滤器、缩放）
设计信息层次（KPI优先，细节在下）
考虑可访问性（颜色、对比度、屏幕阅读器）

4. 设计清晰度和可用性

视觉设计：

选择可访问的颜色方案（色盲安全调色板）
用单位清晰标注轴和数据
移除图表垃圾和不必要的装饰
用注释突出关键洞察
在仪表板中使用一致的样式

UX考虑：

提供上下文（基准线、目标、历史比较）
启用渐进披露（摘要 → 细节）
添加清晰的图例和工具提示
包括数据新鲜度指示器
设计桌面和移动视图

可访问性：

WCAG AA 颜色对比度比率（文本4.5:1，图形3:1）
使用图案/纹理补充颜色
为静态可视化提供替代文本
支持交互图表的键盘导航
包括数据表作为后备

5. 实施和验证

用选择工具构建可视化
以生产规模测试真实数据
验证计算和聚合
验证性能（仪表板加载时间<3秒）
收集最终用户反馈
基于使用模式和分析迭代

最佳实践

一般原则

数据的正确图表：将可视化与数据关系和问题匹配
少即是多：移除不必要元素，最大化数据墨水比
一致的样式：使用连贯的颜色方案和排版
可访问性设计：支持色盲用户、屏幕阅读器、键盘导航
清晰标注：描述性标题、带单位的轴标签、数据来源
上下文重要：包括基准线、目标、历史比较
交互性有帮助时：为探索添加工具提示、过滤器、缩放
性能优先：优化仪表板加载时间<3秒
移动响应式：设计桌面和移动视口

领域特定指南

分析仪表板：

从执行摘要开始（4-6个带趋势的KPI）
支持从摘要向下钻取到细节
包括日期范围选择器和常见过滤器
提供导出选项（PDF、CSV、PNG）
缓存昂贵查询，按计划刷新

基础设施监控：

使用时间序列线图展示随时间变化的指标
设置带颜色带的警报阈值
突出显示当前状态（健康/降级/停机）
包括百分位数指标（p50、p95、p99），不仅是平均值
每30-60秒自动刷新仪表板

机器学习模型性能：

一起绘制训练/验证曲线
用归一化值展示混淆矩阵
用水平条可视化特征重要性
为可解释性包括样本预测
跨实验跟踪指标进行比较

可访问性清单

使用色盲安全调色板（避免仅红/绿）
确保文本对比度比率4.5:1
为静态图像提供替代文本
支持键盘导航（Tab、方向键）
包括数据表作为替代表示
用屏幕阅读器测试（NVDA、JAWS）
使用图案/纹理补充颜色
避免闪烁或快速变化的内容

示例

示例1：Python使用Matplotlib/Seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# 为专业外观设置样式
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# 创建带有子图的图形
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 示例1：时间序列线图
df_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'revenue': [100, 120, 115, 140, 155, 170, 165, 180, 195, 210, 225, 250],
    'target': [110, 115, 120, 130, 145, 160, 175, 185, 200, 215, 230, 245]
})

ax1 = axes[0, 0]
ax1.plot(df_sales['date'], df_sales['revenue'], marker='o', linewidth=2, label='实际')
ax1.plot(df_sales['date'], df_sales['target'], linestyle='--', linewidth=2, label='目标')
ax1.fill_between(df_sales['date'], df_sales['revenue'], df_sales['target'],
                  alpha=0.3, where=(df_sales['revenue'] >= df_sales['target']), color='green')
ax1.fill_between(df_sales['date'], df_sales['revenue'], df_sales['target'],
                  alpha=0.3, where=(df_sales['revenue'] < df_sales['target']), color='red')
ax1.set_title('月度收入 vs 目标', fontsize=14, fontweight='bold')
ax1.set_xlabel('月份')
ax1.set_ylabel('收入（千美元）')
ax1.legend()
ax1.tick_params(axis='x', rotation=45)

# 示例2：比较条形图
df_products = pd.DataFrame({
    'product': ['产品A', '产品B', '产品C', '产品D', '产品E'],
    'sales': [45, 32, 28, 22, 18]
})

ax2 = axes[0, 1]
colors = sns.color_palette("Blues_r", len(df_products))
bars = ax2.barh(df_products['product'], df_products['sales'], color=colors)
ax2.bar_label(bars, padding=3, fmt='$%.0fK')
ax2.set_title('按产品销售', fontsize=14, fontweight='bold')
ax2.set_xlabel('销售（千美元）')
ax2.invert_yaxis()

# 示例3：带回归的散点图
np.random.seed(42)
df_scatter = pd.DataFrame({
    'ad_spend': np.random.uniform(10, 100, 50),
    'conversions': lambda x: x['ad_spend'] * 2.5 + np.random.normal(0, 15, 50)
}.__class__.__call__(pd.DataFrame({'ad_spend': np.random.uniform(10, 100, 50)})))
df_scatter['conversions'] = df_scatter['ad_spend'] * 2.5 + np.random.normal(0, 15, 50)

ax3 = axes[1, 0]
sns.regplot(data=df_scatter, x='ad_spend', y='conversions', ax=ax3,
            scatter_kws={'alpha': 0.6}, line_kws={'color': 'red'})
ax3.set_title('广告支出 vs 转化', fontsize=14, fontweight='bold')
ax3.set_xlabel('广告支出（千美元）')
ax3.set_ylabel('转化')

# 示例4：组成饼图/环形图
df_channels = pd.DataFrame({
    'channel': ['自然', '付费搜索', '社交', '邮件', '直接'],
    'traffic': [35, 25, 20, 12, 8]
})

ax4 = axes[1, 1]
wedges, texts, autotexts = ax4.pie(
    df_channels['traffic'],
    labels=df_channels['channel'],
    autopct='%1.1f%%',
    pctdistance=0.75,
    wedgeprops=dict(width=0.5)
)
ax4.set_title('按渠道流量', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

示例2：使用Plotly的交互可视化

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

# 创建交互式时间序列
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=365, freq='D'),
    'value': (pd.Series(range(365)) * 0.1 +
              np.sin(pd.Series(range(365)) * 0.1) * 20 +
              np.random.normal(0, 5, 365)).cumsum()
})

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=df['date'],
    y=df['value'],
    mode='lines',
    name='每日值',
    line=dict(color='#1f77b4', width=1.5),
    hovertemplate='%{x|%B %d, %Y}<br>值: %{y:.2f}<extra></extra>'
))

# 添加移动平均
df['ma_7'] = df['value'].rolling(7).mean()
fig.add_trace(go.Scatter(
    x=df['date'],
    y=df['ma_7'],
    mode='lines',
    name='7日移动平均',
    line=dict(color='#ff7f0e', width=2, dash='dash')
))

fig.update_layout(
    title='每日性能带移动平均',
    xaxis_title='日期',
    yaxis_title='值',
    hovermode='x unified',
    template='plotly_white',
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=7, label="1周", step="day", stepmode="backward"),
                dict(count=1, label="1月", step="month", stepmode="backward"),
                dict(count=3, label="3月", step="month", stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(visible=True)
    )
)

fig.write_html('interactive_chart.html')
fig.show()

示例3：图表类型选择指南

## 按数据类型选择图表

### 比较

- **条形图**：跨类别比较值
- **分组条形图**：跨类别比较多个系列
- **子弹图**：显示性能相对于目标

### 分布

- **直方图**：显示频率分布
- **箱线图**：显示分布汇总统计
- **小提琴图**：显示分布形状

### 组成

- **饼图/环形图**：显示整体部分（<6个类别）
- **堆叠条形图**：跨类别显示组成
- **树状图**：显示层次组成

### 关系

- **散点图**：显示两个变量之间的相关性
- **气泡图**：通过大小添加第三维度
- **热图**：显示相关矩阵

### 时间序列

- **线图**：显示随时间趋势
- **面积图**：显示累积趋势
- **蜡烛图**：显示OHLC金融数据

### 地理

- **分层设色地图**：按区域显示值
- **点地图**：显示带值的位置
- **流地图**：显示位置之间的移动

示例4：使用Streamlit的分析仪表板

# Streamlit仪表板示例
import streamlit as st
import pandas as pd
import plotly.express as px

st.set_page_config(page_title="销售仪表板", layout="wide")

# 头部
st.title("销售性能仪表板")
st.markdown("---")

# KPI行
col1, col2, col3, col4 = st.columns(4)
with col1:
    st.metric("总收入", "$1.2M", "+12%")
with col2:
    st.metric("订单", "8,543", "+8%")
with col3:
    st.metric("平均订单值", "$140", "+3%")
with col4:
    st.metric("转化率", "3.2%", "-0.5%")

st.markdown("---")

# 过滤器
with st.sidebar:
    st.header("过滤器")
    date_range = st.date_input("日期范围", [])
    region = st.multiselect("区域", ["北", "南", "东", "西"])
    category = st.selectbox("类别", ["全部", "电子产品", "服装", "家居"])

# 主要图表
left_col, right_col = st.columns([2, 1])

with left_col:
    st.subheader("收入趋势")
    # 此处线图

with right_col:
    st.subheader("按区域销售")
    # 此处饼图

# 细节表
st.subheader("近期订单")
# 此处数据表

示例5：用于基础设施监控的Grafana仪表板

# Grafana仪表板JSON结构
# 保存为dashboard.json并通过Grafana UI导入

apiVersion: 1
providers:
  - name: "default"
    folder: "基础设施"
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

---
# 仪表板面板配置
{
  "dashboard":
    {
      "title": "服务健康仪表板",
      "tags": ["基础设施", "监控"],
      "timezone": "浏览器",
      "panels":
        [
          {
            "id": 1,
            "title": "请求率",
            "type": "图形",
            "targets":
              [
                {
                  "expr": "rate(http_requests_total[5m])",
                  "legendFormat": "{{service}}",
                },
              ],
            "yaxes": [{ "format": "reqps", "label": "请求/秒" }],
            "alert":
              {
                "conditions":
                  [
                    {
                      "evaluator": { "params": [100], "type": "gt" },
                      "query": { "params": ["A", "5m", "now"] },
                    },
                  ],
              },
          },
          {
            "id": 2,
            "title": "错误率",
            "type": "图形",
            "targets":
              [
                {
                  "expr": "rate(http_errors_total[5m]) / rate(http_requests_total[5m])",
                  "legendFormat": "错误 %",
                },
              ],
            "fieldConfig":
              {
                "defaults":
                  {
                    "thresholds":
                      {
                        "steps":
                          [
                            { "value": 0, "color": "绿色" },
                            { "value": 0.01, "color": "黄色" },
                            { "value": 0.05, "color": "红色" },
                          ],
                      },
                  },
              },
          },
          {
            "id": 3,
            "title": "响应时间（p95）",
            "type": "图形",
            "targets":
              [
                {
                  "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
                  "legendFormat": "p95",
                },
              ],
          },
        ],
      "refresh": "30秒",
      "time": { "from": "now-1h", "to": "now" },
    },
}

示例6：机器学习模型性能可视化

# TensorBoard风格训练可视化
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# 训练历史可视化
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 损失曲线
ax1 = axes[0, 0]
epochs = range(1, 51)
train_loss = np.exp(-np.array(epochs) / 10) + np.random.normal(0, 0.05, 50)
val_loss = np.exp(-np.array(epochs) / 10) + np.random.normal(0, 0.08, 50) + 0.1

ax1.plot(epochs, train_loss, label='训练损失', linewidth=2)
ax1.plot(epochs, val_loss, label='验证损失', linewidth=2)
ax1.axvline(x=30, color='red', linestyle='--', alpha=0.5, label='最佳模型')
ax1.set_xlabel('轮次')
ax1.set_ylabel('损失')
ax1.set_title('训练和验证损失', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)

# 准确度曲线
ax2 = axes[0, 1]
train_acc = 1 - train_loss
val_acc = 1 - val_loss

ax2.plot(epochs, train_acc, label='训练准确度', linewidth=2)
ax2.plot(epochs, val_acc, label='验证准确度', linewidth=2)
ax2.axvline(x=30, color='red', linestyle='--', alpha=0.5)
ax2.set_xlabel('轮次')
ax2.set_ylabel('准确度')
ax2.set_title('训练和验证准确度', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)

# 混淆矩阵
ax3 = axes[1, 0]
y_true = np.random.randint(0, 3, 500)
y_pred = y_true.copy()
y_pred[np.random.random(500) < 0.15] = np.random.randint(0, 3, (y_pred.shape[0]))

cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax3,
            xticklabels=['类别A', '类别B', '类别C'],
            yticklabels=['类别A', '类别B', '类别C'])
ax3.set_xlabel('预测')
ax3.set_ylabel('实际')
ax3.set_title('混淆矩阵', fontsize=14, fontweight='bold')

# 特征重要性
ax4 = axes[1, 1]
features = ['特征A', '特征B', '特征C', '特征D', '特征E']
importance = np.array([0.35, 0.28, 0.18, 0.12, 0.07])

colors = plt.cm.viridis(importance / importance.max())
bars = ax4.barh(features, importance, color=colors)
ax4.set_xlabel('重要性')
ax4.set_title('特征重要性', fontsize=14, fontweight='bold')
ax4.invert_yaxis()

for i, (bar, val) in enumerate(zip(bars, importance)):
    ax4.text(val + 0.01, i, f'{val:.2f}', va='center')

plt.tight_layout()
plt.savefig('ml_performance_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

示例7：可访问颜色调色板

# 用于数据可视化的色盲安全调色板
import matplotlib.pyplot as plt
import seaborn as sns

# 定义可访问颜色方案
palettes = {
    'colorblind_safe': ['#0173B2', '#DE8F05', '#029E73', '#CC78BC', '#CA9161'],
    'high_contrast': ['#000000', '#E69F00', '#56B4E9', '#009E73', '#F0E442'],
    'okabe_ito': ['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7']
}

# 用可访问调色板测试可视化
data = [25, 20, 18, 15, 12, 10]
labels = ['A', 'B', 'C', 'D', 'E', 'F']

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, colors) in zip(axes, palettes.items()):
    ax.bar(labels, data, color=colors[:len(data)])
    ax.set_title(f'{name.replace("_", " ").title()}', fontsize=12, fontweight='bold')
    ax.set_ylabel('值')

    # 为可访问性添加值标签
    for i, (label, value) in enumerate(zip(labels, data)):
        ax.text(i, value + 0.5, str(value), ha='center', va='bottom')

plt.tight_layout()
plt.savefig('accessible_palettes.png', dpi=150, bbox_inches='tight')