name: nccl-communication description: NVIDIA 集体通信库集成，用于多 GPU 操作。初始化 NCCL 通信器、执行集体操作、配置通信拓扑、分析集体性能，并支持 RCCL 以实现 AMD 兼容性。 allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch metadata: author: babysitter-sdk version: “1.0.0” category: multi-gpu backlog-id: SK-007

nccl-communication

您是 nccl-communication - 一个专门用于 NVIDIA 集体通信库 (NCCL) 集成的技能。此技能为多 GPU 集体操作提供专家级能力。

概述

此技能支持 AI 驱动的多 GPU 通信，包括：

初始化 NCCL 通信器
执行全归约、全收集、归约分散操作
配置环形和树形通信拓扑
处理多节点 NCCL 通信
分析集体操作性能
针对 NVLink 与 PCIe 拓扑进行优化
与 CUDA 流集成以实现异步集体操作
支持 RCCL 以实现 AMD GPU 兼容性

前提条件

CUDA Toolkit 11.0+
NCCL 2.10+
多个 GPU（以实现有意义的使用）
MPI（用于多节点，可选）

能力

1. NCCL 初始化

初始化通信器：

#include <nccl.h>

// 单节点多 GPU 初始化
int numGPUs = 4;
ncclComm_t comms[4];
int devs[4] = {0, 1, 2, 3};

ncclCommInitAll(comms, numGPUs, devs);

// 用于 MPI 集成的每 rank 初始化
ncclUniqueId id;
ncclComm_t comm;

if (rank == 0) {
    ncclGetUniqueId(&id);
}
MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

cudaSetDevice(localRank);
ncclCommInitRank(&comm, worldSize, id, rank);

// 清理
ncclCommDestroy(comm);

2. 全归约操作

跨所有 GPU 进行归约：

// 同步全归约
ncclAllReduce(sendbuff, recvbuff, count, ncclFloat,
    ncclSum, comm, stream);
cudaStreamSynchronize(stream);

// 原地全归约
ncclAllReduce(buff, buff, count, ncclFloat, ncclSum, comm, stream);

// 支持的归约操作：
// ncclSum, ncclProd, ncclMax, ncclMin, ncclAvg

// 多种数据类型：
// ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64
// ncclFloat16, ncclFloat32, ncclFloat64, ncclBfloat16

3. 全收集操作

从所有 GPU 收集数据：

// 全收集：每个 GPU 贡献 sendcount 个元素
// 结果：每个 GPU 的 recvbuff 有 numGPUs * sendcount 个元素
ncclAllGather(sendbuff, recvbuff, sendcount, ncclFloat, comm, stream);

// 验证输出大小
size_t totalElements = sendcount * numGPUs;

4. 归约分散操作

// 归约分散：归约并分散到每个 GPU
// 每个 GPU 获得归约结果的 1/numGPUs
ncclReduceScatter(sendbuff, recvbuff, recvcount, ncclFloat,
    ncclSum, comm, stream);

// 在数据并行中用于梯度归约

5. 广播与归约

// 从根节点广播到所有节点
int root = 0;
ncclBroadcast(sendbuff, recvbuff, count, ncclFloat, root, comm, stream);

// 原地广播
ncclBroadcast(buff, buff, count, ncclFloat, root, comm, stream);

// 归约到根节点
ncclReduce(sendbuff, recvbuff, count, ncclFloat, ncclSum, root, comm, stream);

6. 组操作

批处理多个操作：

// 开始组
ncclGroupStart();

// 排队多个操作
ncclAllReduce(buff1, buff1, count1, ncclFloat, ncclSum, comm, stream);
ncclAllReduce(buff2, buff2, count2, ncclFloat, ncclSum, comm, stream);
ncclBroadcast(buff3, buff3, count3, ncclFloat, 0, comm, stream);

// 结束组 - 操作高效执行
ncclGroupEnd();

// 适用于：
// - 单次启动中的多个集体操作
// - 用于点对点的发送/接收对

7. 点对点通信

// 从 rank 0 发送到 rank 1
if (rank == 0) {
    ncclSend(sendbuff, count, ncclFloat, 1, comm, stream);
} else if (rank == 1) {
    ncclRecv(recvbuff, count, ncclFloat, 0, comm, stream);
}

// 使用组进行双向交换
ncclGroupStart();
ncclSend(sendbuff, count, ncclFloat, peerRank, comm, stream);
ncclRecv(recvbuff, count, ncclFloat, peerRank, comm, stream);
ncclGroupEnd();

8. 拓扑优化

为硬件拓扑配置：

# 检查 GPU 拓扑
nvidia-smi topo -m

# 用于优化的环境变量
export NCCL_TOPO_FILE=/path/to/topo.xml
export NCCL_GRAPH_FILE=/path/to/graph.xml

# 算法选择
export NCCL_ALGO=Tree       # 树形归约
export NCCL_ALGO=Ring       # 环形归约
export NCCL_ALGO=CollnetDirect  # NVSwitch 直连

# 协议选择
export NCCL_PROTO=Simple    # 默认
export NCCL_PROTO=LL        # 低延迟
export NCCL_PROTO=LL128     # 低延迟 128 字节

# 网络设置
export NCCL_IB_DISABLE=0    # 启用 InfiniBand
export NCCL_NET_GDR_LEVEL=5 # GPU 直接 RDMA 级别

9. 多节点设置

// 使用 MPI 的多节点设置
#include <mpi.h>
#include <nccl.h>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);

    int worldSize, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // 获取本地 rank 以分配 GPU
    int localRank;
    MPI_Comm localComm;
    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
        MPI_INFO_NULL, &localComm);
    MPI_Comm_rank(localComm, &localRank);

    // 初始化 NCCL
    ncclUniqueId id;
    if (rank == 0) ncclGetUniqueId(&id);
    MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

    cudaSetDevice(localRank);
    ncclComm_t comm;
    ncclCommInitRank(&comm, worldSize, id, rank);

    // 使用 comm 进行集体操作...

    ncclCommDestroy(comm);
    MPI_Finalize();
    return 0;
}

10. 性能分析

// 使用 CUDA 事件进行 NCCL 计时
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
ncclAllReduce(buff, buff, count, ncclFloat, ncclSum, comm, stream);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);

float milliseconds;
cudaEventElapsedTime(&milliseconds, start, stop);

// 计算带宽
size_t bytes = count * sizeof(float);
float algoBW = bytes / milliseconds / 1e6;  // GB/s
float busBW = algoBW * 2 * (numGPUs - 1) / numGPUs;  // 总线带宽
printf("AllReduce: %.2f ms, %.2f GB/s (bus: %.2f GB/s)
",
    milliseconds, algoBW, busBW);

# 启用 NCCL 调试输出
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# 用于基准测试的 NCCL 测试
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4

流程集成

此技能与以下流程集成：

multi-gpu-programming.js - 多 GPU 开发
gpu-cluster-computing.js - 集群计算

输出格式

{
  "operation": "all-reduce",
  "status": "success",
  "configuration": {
    "num_gpus": 4,
    "data_size_bytes": 268435456,
    "data_type": "float32",
    "reduction": "sum"
  },
  "performance": {
    "time_ms": 2.34,
    "algorithm_bandwidth_gbps": 114.5,
    "bus_bandwidth_gbps": 171.8
  },
  "topology": {
    "interconnect": "NVLink",
    "algorithm": "Tree",
    "protocol": "LL128"
  }
}

依赖项

CUDA Toolkit 11.0+
NCCL 2.10+
MPI（可选，用于多节点）

约束

所有 rank 必须以相同顺序调用集体操作
跨 rank 的缓冲区大小必须匹配
流顺序必须一致
组操作必须平衡（发送/接收对）