GPU COMMUNICATION

NCCL vs RCCL

Multi-GPU Communication for NVIDIA B200 & AMD MI350X

NVIDIA NCCL 2.21

AMD RCCL 6.1

NVIDIA NCCL

NVIDIA Collective Communications Library

⚡ NVLink 5.0 1.8 TB/s
🔀 NVSwitch DGX B200
🌐 InfiniBand GPUDirect RDMA
🎯 SHARP In-network compute
📈 Scaling 10,000+ GPUs
🌳 Algorithms Ring/Tree/RHD

AMD RCCL

ROCm Communication Collectives Library

🔗 XGMI Support 8 TB/s
🔄 NCCL Compatible Drop-in API
🌐 InfiniBand ROCm RDMA
📖 Open Source BSD License
📈 Scaling OAM/Multi-node
🌳 Algorithms Ring/Tree

Collective Operations

AllReduce

Reduce + broadcast to all ranks. The backbone of data-parallel training.

Time: O(S·(n-1)/n) | BW: 2S(n-1)/n

AllGather

Gather from all, distribute to all. Essential for tensor parallelism.

Time: O(S·(n-1)/n) | BW: S(n-1)/n

ReduceScatter

Reduce + scatter result chunks. Key for ZeRO optimizer sharding.

Time: O(S·(n-1)/n) | BW: S(n-1)/n

Broadcast

One-to-all distribution. Initialize models, sync seeds.

Time: O(S·log(n)) | BW: S (tree)

All-to-All

Personalized exchange. Powers MoE token routing.

Time: O(S) | BW: S(n-1)/n

Send/Recv

Point-to-point transfer. Pipeline parallelism backbone.

Time: O(S/BW) | Latency-bound

Algorithm Deep Dive

Ring Algorithm

Bandwidth-optimal for large messages (>256KB)

Steps

2(n-1)

Bandwidth

2S(n-1)/n

Latency

O(n)

Best For

>256KB

Tree Algorithm

Latency-optimal for small messages (<256KB)

Steps

2·log₂(n)

Bandwidth

S·log₂(n)

Latency

O(log n)

Best For

<256KB

Interconnect Topology

NVIDIA B200 (NVSwitch Full Mesh)

Topology

Full Mesh

Per-GPU BW

1.8 TB/s

Bisection

14.4 TB/s

Hops

1 (direct)

AMD MI350X (XGMI Mesh)

Topology

Mesh/Ring

Memory

288 GB

Mem BW

8 TB/s

Precision

FP4/FP6/FP8

Bandwidth Comparison

Intra-node GPU-to-GPU (per link)

NVIDIA B200

1.8 TB/s

AMD MI350X

8 TB/s mem

Inter-node (InfiniBand NDR800)

NVIDIA

800 Gb/s

AMD

800 Gb/s

Total Bisection Bandwidth (8 GPU node)

NVIDIA DGX

14.4 TB/s

AMD OAM

~64 TB/s

Detailed Comparison

NCCL RCCL

General

License BSD (NVIDIA only) BSD (Open Source)

API nccl.h nccl.h (compatible)

Interconnect

GPU-to-GPU NVLink 5.0 XGMI / Infinity Fabric

Switch Fabric NVSwitch (full mesh) N/A (direct mesh)

RDMA Support GPUDirect RDMA ROCm RDMA

Advanced Features

In-Network Compute SHARP ✓ —

User Buffer Reg ✓ ✓

Code Examples

NCCL - AllReduce

// Initialize communicator
ncclComm_t comm;
ncclCommInitRank(&comm, nRanks, id, myRank);

// Create CUDA stream
cudaStream_t stream;
cudaStreamCreate(&stream);

// Allocate GPU buffers
float *sendbuff, *recvbuff;
cudaMalloc(&sendbuff, size * sizeof(float));
cudaMalloc(&recvbuff, size * sizeof(float));

// Perform AllReduce (sum)
ncclAllReduce(
    sendbuff,              // send buffer
    recvbuff,              // recv buffer
    size,                  // count
    ncclFloat,             // datatype
    ncclSum,               // reduction op
    comm,                  // communicator
    stream                 // CUDA stream
);

// Wait for completion
cudaStreamSynchronize(stream);

RCCL - AllReduce (Same API!)

// Initialize (same API as NCCL)
ncclComm_t comm;
ncclCommInitRank(&comm, nRanks, id, myRank);

// Create HIP stream
hipStream_t stream;
hipStreamCreate(&stream);

// Allocate GPU buffers
float *sendbuff, *recvbuff;
hipMalloc(&sendbuff, size * sizeof(float));
hipMalloc(&recvbuff, size * sizeof(float));

// Perform AllReduce (identical call!)
ncclAllReduce(
    sendbuff,              // send buffer
    recvbuff,              // recv buffer
    size,                  // count
    ncclFloat,             // datatype
    ncclSum,               // reduction op
    comm,                  // communicator
    stream                 // HIP stream
);

// Wait for completion
hipStreamSynchronize(stream);

Use Cases in Training

📊

Data Parallelism

AllReduce gradients after backward pass. Each GPU processes different data batches.

AllReduce

🔲

Tensor Parallelism

Split layers across GPUs. AllGather/ReduceScatter for activations.

AllGather + ReduceScatter

🔗

Pipeline Parallelism

Sequential stages across GPUs. Point-to-point between stages.

Send/Recv

💾

ZeRO Sharding

Shard optimizer/gradients/params for memory efficiency.

ReduceScatter + AllGather

🎯

Expert Parallelism

Route tokens to expert GPUs in MoE models.

All-to-All

📝

Sequence Parallelism

Split sequences for long-context training.

AllGather + ReduceScatter