name: tensorboard description: 使用TensorBoard - Google的机器学习可视化工具包,可视化训练指标、用直方图调试模型、比较实验、可视化模型图以及分析性能。 version: 1.0.0 author: Orchestra Research license: MIT tags: [MLOps, TensorBoard, 可视化, 训练指标, 模型调试, PyTorch, TensorFlow, 实验跟踪, 性能分析] dependencies: [tensorboard, torch, tensorflow]
TensorBoard: 机器学习可视化工具包
何时使用此技能
在以下情况下使用TensorBoard:
- 可视化训练指标,如随时间变化的损失和准确率
- 用直方图和分布调试模型
- 比较多个实验
- 可视化模型图和架构
- 将嵌入投影到低维空间(t-SNE、PCA)
- 跟踪超参数实验
- 分析性能并识别瓶颈
- 在训练期间可视化图像和文本
用户: 每年2000万+下载量 | GitHub星标: 27k+ | 许可证: Apache 2.0
安装
# 安装TensorBoard
pip install tensorboard
# PyTorch集成
pip install torch torchvision tensorboard
# TensorFlow集成(包含TensorBoard)
pip install tensorflow
# 启动TensorBoard
tensorboard --logdir=runs
# 访问 http://localhost:6006
快速开始
PyTorch
from torch.utils.tensorboard import SummaryWriter
# 创建写入器
writer = SummaryWriter('runs/experiment_1')
# 训练循环
for epoch in range(10):
train_loss = train_epoch()
val_acc = validate()
# 记录指标
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
# 关闭写入器
writer.close()
# 启动: tensorboard --logdir=runs
TensorFlow/Keras
import tensorflow as tf
# 创建回调
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1
)
# 训练模型
model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_val, y_val),
callbacks=[tensorboard_callback]
)
# 启动: tensorboard --logdir=logs
核心概念
1. SummaryWriter (PyTorch)
from torch.utils.tensorboard import SummaryWriter
# 默认目录: runs/CURRENT_DATETIME
writer = SummaryWriter()
# 自定义目录
writer = SummaryWriter('runs/experiment_1')
# 自定义注释(附加到默认目录)
writer = SummaryWriter(comment='baseline')
# 记录数据
writer.add_scalar('Loss/train', 0.5, step=0)
writer.add_scalar('Loss/train', 0.3, step=1)
# 刷新并关闭
writer.flush()
writer.close()
2. 记录标量
# PyTorch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for epoch in range(100):
train_loss = train()
val_loss = validate()
# 记录单个指标
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/val', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
# 学习率
lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning_rate', lr, epoch)
writer.close()
# TensorFlow
import tensorflow as tf
train_summary_writer = tf.summary.create_file_writer('logs/train')
val_summary_writer = tf.summary.create_file_writer('logs/val')
for epoch in range(100):
with train_summary_writer.as_default():
tf.summary.scalar('loss', train_loss, step=epoch)
tf.summary.scalar('accuracy', train_acc, step=epoch)
with val_summary_writer.as_default():
tf.summary.scalar('loss', val_loss, step=epoch)
tf.summary.scalar('accuracy', val_acc, step=epoch)
3. 记录多个标量
# PyTorch: 分组相关指标
writer.add_scalars('Loss', {
'train': train_loss,
'validation': val_loss,
'test': test_loss
}, epoch)
writer.add_scalars('Metrics', {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1_score
}, epoch)
4. 记录图像
# PyTorch
import torch
from torchvision.utils import make_grid
# 单个图像
writer.add_image('Input/sample', img_tensor, epoch)
# 多个图像作为网格
img_grid = make_grid(images[:64], nrow=8)
writer.add_image('Batch/inputs', img_grid, epoch)
# 预测可视化
pred_grid = make_grid(predictions[:16], nrow=4)
writer.add_image('Predictions', pred_grid, epoch)
# TensorFlow
import tensorflow as tf
with file_writer.as_default():
# 将图像编码为PNG
tf.summary.image('Training samples', images, step=epoch, max_outputs=25)
5. 记录直方图
# PyTorch: 跟踪权重分布
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
# 跟踪梯度
if param.grad is not None:
writer.add_histogram(f'{name}.grad', param.grad, epoch)
# 跟踪激活
writer.add_histogram('Activations/relu1', activations, epoch)
# TensorFlow
with file_writer.as_default():
tf.summary.histogram('weights/layer1', layer1.kernel, step=epoch)
tf.summary.histogram('activations/relu1', activations, step=epoch)
6. 记录模型图
# PyTorch
import torch
model = MyModel()
dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input)
writer.close()
# TensorFlow (Keras中自动)
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs',
write_graph=True
)
model.fit(x, y, callbacks=[tensorboard_callback])
高级功能
嵌入投影器
在2D/3D中可视化高维数据(嵌入、特征)。
import torch
from torch.utils.tensorboard import SummaryWriter
# 获取嵌入(例如,词嵌入、图像特征)
embeddings = model.get_embeddings(data) # 形状: (N, embedding_dim)
# 元数据(每个点的标签)
metadata = ['class_1', 'class_2', 'class_1', ...]
# 图像(可选,用于图像嵌入)
label_images = torch.stack([img1, img2, img3, ...])
# 记录到TensorBoard
writer.add_embedding(
embeddings,
metadata=metadata,
label_img=label_images,
global_step=epoch
)
在TensorBoard中:
- 导航到“Projector”标签
- 选择PCA、t-SNE或UMAP可视化
- 搜索、过滤和探索聚类
超参数调优
from torch.utils.tensorboard import SummaryWriter
# 尝试不同的超参数
for lr in [0.001, 0.01, 0.1]:
for batch_size in [16, 32, 64]:
# 创建唯一运行目录
writer = SummaryWriter(f'runs/lr{lr}_bs{batch_size}')
# 记录超参数
writer.add_hparams(
{'lr': lr, 'batch_size': batch_size},
{'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
)
# 训练和记录
for epoch in range(10):
loss = train(lr, batch_size)
writer.add_scalar('Loss/train', loss, epoch)
writer.close()
# 在TensorBoard的“HParams”标签中比较
文本记录
# PyTorch: 记录文本(例如,模型预测、摘要)
writer.add_text('Predictions', f'Epoch {epoch}: {predictions}', epoch)
writer.add_text('Config', str(config), 0)
# 记录Markdown表格
markdown_table = """
| Metric | Value |
|--------|-------|
| Accuracy | 0.95 |
| F1 Score | 0.93 |
"""
writer.add_text('Results', markdown_table, epoch)
PR曲线
分类的精确率-召回率曲线。
from torch.utils.tensorboard import SummaryWriter
# 获取预测和标签
predictions = model(test_data) # 形状: (N, num_classes)
labels = test_labels # 形状: (N,)
# 为每个类记录PR曲线
for i in range(num_classes):
writer.add_pr_curve(
f'PR_curve/class_{i}',
labels == i,
predictions[:, i],
global_step=epoch
)
集成示例
PyTorch训练循环
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
# 设置
writer = SummaryWriter('runs/resnet_experiment')
model = ResNet50()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# 记录模型图
dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input)
# 训练循环
for epoch in range(50):
model.train()
train_loss = 0.0
train_correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()
pred = output.argmax(dim=1)
train_correct += pred.eq(target).sum().item()
# 记录批次指标(每100批次)
if batch_idx % 100 == 0:
global_step = epoch * len(train_loader) + batch_idx
writer.add_scalar('Loss/train_batch', loss.item(), global_step)
# 时期指标
train_loss /= len(train_loader)
train_acc = train_correct / len(train_loader.dataset)
# 验证
model.eval()
val_loss = 0.0
val_correct = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data)
val_loss += criterion(output, target).item()
pred = output.argmax(dim=1)
val_correct += pred.eq(target).sum().item()
val_loss /= len(val_loader)
val_acc = val_correct / len(val_loader.dataset)
# 记录时期指标
writer.add_scalars('Loss', {'train': train_loss, 'val': val_loss}, epoch)
writer.add_scalars('Accuracy', {'train': train_acc, 'val': val_acc}, epoch)
# 记录学习率
writer.add_scalar('Learning_rate', optimizer.param_groups[0]['lr'], epoch)
# 记录直方图(每5时期)
if epoch % 5 == 0:
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
# 记录样本预测
if epoch % 10 == 0:
sample_images = data[:8]
writer.add_image('Sample_inputs', make_grid(sample_images), epoch)
writer.close()
TensorFlow/Keras训练
import tensorflow as tf
# 定义模型
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# TensorBoard回调
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1, # 每时期记录直方图
write_graph=True, # 可视化模型图
write_images=True, # 将权重可视化为图像
update_freq='epoch', # 每时期记录指标
profile_batch='500,520', # 分析批次500-520
embeddings_freq=1 # 每时期记录嵌入
)
# 训练
model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_val, y_val),
callbacks=[tensorboard_callback]
)
比较实验
多个运行
# 用不同配置运行实验
python train.py --lr 0.001 --logdir runs/exp1
python train.py --lr 0.01 --logdir runs/exp2
python train.py --lr 0.1 --logdir runs/exp3
# 一起查看所有运行
tensorboard --logdir=runs
在TensorBoard中:
- 所有运行出现在同一仪表板
- 切换运行开启/关闭以比较
- 使用正则表达式过滤运行名称
- 叠加图表以比较指标
组织实验
# 分层组织
runs/
├── baseline/
│ ├── run_1/
│ └── run_2/
├── improved/
│ ├── run_1/
│ └── run_2/
└── final/
└── run_1/
# 用层次记录
writer = SummaryWriter('runs/baseline/run_1')
最佳实践
1. 使用描述性运行名称
# ✅ 好: 描述性名称
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
writer = SummaryWriter(f'runs/resnet50_lr0.001_bs32_{timestamp}')
# ❌ 坏: 自动生成的名称
writer = SummaryWriter() # 创建 runs/Jan01_12-34-56_hostname
2. 分组相关指标
# ✅ 好: 分组指标
writer.add_scalar('Loss/train', train_loss, step)
writer.add_scalar('Loss/val', val_loss, step)
writer.add_scalar('Accuracy/train', train_acc, step)
writer.add_scalar('Accuracy/val', val_acc, step)
# ❌ 坏: 平坦命名空间
writer.add_scalar('train_loss', train_loss, step)
writer.add_scalar('val_loss', val_loss, step)
3. 定期记录但不过于频繁
# ✅ 好: 总是记录时期指标,偶尔记录批次指标
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
loss = train_step(data, target)
# 每100批次记录
if batch_idx % 100 == 0:
writer.add_scalar('Loss/batch', loss, global_step)
# 总是记录时期指标
writer.add_scalar('Loss/epoch', epoch_loss, epoch)
# ❌ 坏: 记录每个批次(创建巨大日志文件)
for batch in train_loader:
writer.add_scalar('Loss', loss, step) # 太频繁
4. 完成后关闭写入器
# ✅ 好: 使用上下文管理器
with SummaryWriter('runs/exp1') as writer:
for epoch in range(10):
writer.add_scalar('Loss', loss, epoch)
# 自动关闭
# 或手动
writer = SummaryWriter('runs/exp1')
# ... 记录 ...
writer.close()
5. 对训练/验证使用单独的写入器
# ✅ 好: 单独的日志目录
train_writer = SummaryWriter('runs/exp1/train')
val_writer = SummaryWriter('runs/exp1/val')
train_writer.add_scalar('loss', train_loss, epoch)
val_writer.add_scalar('loss', val_loss, epoch)
性能分析
TensorFlow分析器
# 启用分析
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs',
profile_batch='10,20' # 分析批次10-20
)
model.fit(x, y, callbacks=[tensorboard_callback])
# 在TensorBoard Profile标签中查看
# 显示: GPU利用率、内核统计、内存使用、瓶颈
PyTorch分析器
import torch.profiler as profiler
with profiler.profile(
activities=[
profiler.ProfilerActivity.CPU,
profiler.ProfilerActivity.CUDA
],
on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
record_shapes=True,
with_stack=True
) as prof:
for batch in train_loader:
loss = train_step(batch)
prof.step()
# 在TensorBoard Profile标签中查看
资源
- 文档: https://www.tensorflow.org/tensorboard
- PyTorch集成: https://pytorch.org/docs/stable/tensorboard.html
- GitHub: https://github.com/tensorflow/tensorboard (27k+ stars)
- TensorBoard.dev: https://tensorboard.dev (公开分享实验)
另请参阅
references/visualization.md- 全面的可视化指南references/profiling.md- 性能分析模式references/integrations.md- 框架特定的集成示例