14.2
GPU COMMUNICATION

NCCL vs RCCL

Multi-GPU Communication for NVIDIA B200 & AMD MI350X

NVIDIA NCCL 2.21
AMD RCCL 6.1

NVIDIA NCCL

NVIDIA Collective Communications Library

  • NVLink 5.0 1.8 TB/s
  • 🔀 NVSwitch DGX B200
  • 🌐 InfiniBand GPUDirect RDMA
  • 🎯 SHARP In-network compute
  • 📈 Scaling 10,000+ GPUs
  • 🌳 Algorithms Ring/Tree/RHD

AMD RCCL

ROCm Communication Collectives Library

  • 🔗 XGMI Support 8 TB/s
  • 🔄 NCCL Compatible Drop-in API
  • 🌐 InfiniBand ROCm RDMA
  • 📖 Open Source BSD License
  • 📈 Scaling OAM/Multi-node
  • 🌳 Algorithms Ring/Tree

Collective Operations

AllReduce
Reduce + broadcast to all ranks. The backbone of data-parallel training.
BEFORE GPU 0 [1, 2] GPU 1 [3, 4] GPU 2 [5, 6] GPU 3 [7, 8] Σ SUM AFTER GPU 0 [16, 20] GPU 1 [16, 20] GPU 2 [16, 20] GPU 3 [16, 20]
Time: O(S·(n-1)/n) | BW: 2S(n-1)/n
AllGather
Gather from all, distribute to all. Essential for tensor parallelism.
BEFORE GPU 0 A GPU 1 B GPU 2 C GPU 3 D GATHER AFTER GPU 0 ABCD GPU 1 ABCD GPU 2 ABCD GPU 3 ABCD
Time: O(S·(n-1)/n) | BW: S(n-1)/n
ReduceScatter
Reduce + scatter result chunks. Key for ZeRO optimizer sharding.
BEFORE GPU 0 [A B C D] GPU 1 [E F G H] GPU 2 [I J K L] GPU 3 [M N O P] Σ + SCATTER AFTER GPU 0 Σ[0] GPU 1 Σ[1] GPU 2 Σ[2] GPU 3 Σ[3]
Time: O(S·(n-1)/n) | BW: S(n-1)/n
Broadcast
One-to-all distribution. Initialize models, sync seeds.
BEFORE GPU 0 DATA GPU 1 empty GPU 2 empty GPU 3 empty BROADCAST AFTER GPU 0 DATA GPU 1 DATA GPU 2 DATA GPU 3 DATA
Time: O(S·log(n)) | BW: S (tree)
All-to-All
Personalized exchange. Powers MoE token routing.
Each GPU sends unique data to every other GPU GPU 0 [A₀ A₁ A₂ A₃] GPU 1 [B₀ B₁ B₂ B₃] GPU 2 [C₀ C₁ C₂ C₃] GPU 3 [D₀ D₁ D₂ D₃] EXCHANGE After: GPU i has [A_i, B_i, C_i, D_i]
Time: O(S) | BW: S(n-1)/n
Send/Recv
Point-to-point transfer. Pipeline parallelism backbone.
Pipeline Parallelism: Pass Activations Between Stages Stage 0 (GPU 0) Activations Stage 1 (GPU 1) Waiting... ACT SEND → RECV Forward pass: Stage 0 → Stage 1 Backward pass: Stage 1 → Stage 0 (gradients)
Time: O(S/BW) | Latency-bound

Algorithm Deep Dive

Ring Algorithm

Bandwidth-optimal for large messages (>256KB)
0 1 2 3
Steps
2(n-1)
Bandwidth
2S(n-1)/n
Latency
O(n)
Best For
>256KB

Tree Algorithm

Latency-optimal for small messages (<256KB)
Root L1 L1 0 1 2 3
Steps
2·log₂(n)
Bandwidth
S·log₂(n)
Latency
O(log n)
Best For
<256KB

Interconnect Topology

NVIDIA B200 (NVSwitch Full Mesh)

NVSwitch Full Fabric GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 Full Mesh: Any-to-any @ 1.8 TB/s
Topology
Full Mesh
Per-GPU BW
1.8 TB/s
Bisection
14.4 TB/s
Hops
1 (direct)

AMD MI350X (XGMI Mesh)

GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 288GB HBM3e @ 8 TB/s memory bandwidth
Topology
Mesh/Ring
Memory
288 GB
Mem BW
8 TB/s
Precision
FP4/FP6/FP8

Bandwidth Comparison

Intra-node GPU-to-GPU (per link)
NVIDIA B200
1.8 TB/s
AMD MI350X
8 TB/s mem
Inter-node (InfiniBand NDR800)
NVIDIA
800 Gb/s
AMD
800 Gb/s
Total Bisection Bandwidth (8 GPU node)
NVIDIA DGX
14.4 TB/s
AMD OAM
~64 TB/s

Detailed Comparison

NCCL RCCL

General

License BSD (NVIDIA only) BSD (Open Source)
API nccl.h nccl.h (compatible)

Interconnect

GPU-to-GPU NVLink 5.0 XGMI / Infinity Fabric
Switch Fabric NVSwitch (full mesh) N/A (direct mesh)
RDMA Support GPUDirect RDMA ROCm RDMA

Advanced Features

In-Network Compute SHARP ✓
User Buffer Reg

Code Examples

NCCL - AllReduce
// Initialize communicator
ncclComm_t comm;
ncclCommInitRank(&comm, nRanks, id, myRank);

// Create CUDA stream
cudaStream_t stream;
cudaStreamCreate(&stream);

// Allocate GPU buffers
float *sendbuff, *recvbuff;
cudaMalloc(&sendbuff, size * sizeof(float));
cudaMalloc(&recvbuff, size * sizeof(float));

// Perform AllReduce (sum)
ncclAllReduce(
    sendbuff,              // send buffer
    recvbuff,              // recv buffer
    size,                  // count
    ncclFloat,             // datatype
    ncclSum,               // reduction op
    comm,                  // communicator
    stream                 // CUDA stream
);

// Wait for completion
cudaStreamSynchronize(stream);
RCCL - AllReduce (Same API!)
// Initialize (same API as NCCL)
ncclComm_t comm;
ncclCommInitRank(&comm, nRanks, id, myRank);

// Create HIP stream
hipStream_t stream;
hipStreamCreate(&stream);

// Allocate GPU buffers
float *sendbuff, *recvbuff;
hipMalloc(&sendbuff, size * sizeof(float));
hipMalloc(&recvbuff, size * sizeof(float));

// Perform AllReduce (identical call!)
ncclAllReduce(
    sendbuff,              // send buffer
    recvbuff,              // recv buffer
    size,                  // count
    ncclFloat,             // datatype
    ncclSum,               // reduction op
    comm,                  // communicator
    stream                 // HIP stream
);

// Wait for completion
hipStreamSynchronize(stream);

Use Cases in Training

📊
Data Parallelism
AllReduce gradients after backward pass. Each GPU processes different data batches.
AllReduce
🔲
Tensor Parallelism
Split layers across GPUs. AllGather/ReduceScatter for activations.
AllGather + ReduceScatter
🔗
Pipeline Parallelism
Sequential stages across GPUs. Point-to-point between stages.
Send/Recv
💾
ZeRO Sharding
Shard optimizer/gradients/params for memory efficiency.
ReduceScatter + AllGather
🎯
Expert Parallelism
Route tokens to expert GPUs in MoE models.
All-to-All
📝
Sequence Parallelism
Split sequences for long-context training.
AllGather + ReduceScatter