Appendix B.16: Data Paths Visual Guide

🔀

GPU ↔ NVMe Data Paths & Topologies

Traditional (bounce buffer) vs GPUDirect Storage (direct DMA) + all PCIe topologies

💡

Key Clarification: "No CPU Involvement"

"No CPU involvement" means no CPU data movement (copies through CPU memory). The CPU is still involved in the control path: submitting I/O, setting up DMA mappings, filesystem metadata, etc. NVIDIA explicitly documents this: drivers run on CPU for control path, while data path avoids host-DRAM staging copies in the bulk path (platform dependent).

A) Traditional Path: CPU + Host DRAM Bounce Buffer

SLOW READ: NVMe → GPU (pread + cudaMemcpy)

SLOW WRITE: GPU → NVMe (cudaMemcpy + pwrite)

B) GPUDirect Storage: Direct DMA (No Host Bounce)

FAST READ: NVMe → GPU (cuFileRead)

FAST WRITE: GPU → NVMe (cuFileWrite)

C) GPUDirect Storage: Queue-Based Architecture

Just like RDMA extends NVMe over fabric, GPUDirect Storage extends NVMe to GPU memory. SW-HW communication through Submission & Completion queues — GPU memory as DMA target.

NVMe-oF (NVMe over Fabrics) with GPUDirect RDMA

For remote storage, NVMe-oF uses RDMA to extend NVMe across the network. With GPUDirect RDMA, the NIC can DMA directly to GPU memory — combining GDS + GPUDirect RDMA.

🚀

The Complete Picture

Local NVMe: NVMe controller DMA → GPU HBM (GDS)
Remote NVMe-oF: NIC RDMA → GPU HBM (GPUDirect RDMA)

Both use the same queue-based model. The key is that the DMA target address points to GPU memory (via PCIe BAR) instead of host DRAM.

D) PCIe Topologies: All Possible Configurations

P2P capability depends on whether the NVMe controller (local) or NIC (remote) can DMA directly to GPU memory via PCIe. These topologies determine if you get true P2P or fallback bounce.

Local NVMe (Direct-Attached PCIe)

ID	Topology	P2P?	Notes
A1	GPU + NVMe under same PCIe switch	✅ Best	Canonical P2P. DGX systems pair GPU+NVMe this way. Shortest path.
A2	GPU + NVMe under different switches, same root complex	⚠️ Maybe	Platform/ACS/IOMMU dependent. May work or fall back to bounce.
A3	GPU + NVMe on different root ports (no shared switch)	⚠️ Often problematic	Frequently becomes bounce on many servers.
A4	GPU + NVMe on different CPU sockets / root complexes	❌ No	Typically forces bounce. Cross-socket P2P rarely works.
A5	NVMe behind RAID/HBA/virtualization layer	⚠️ Depends	Driver stack dependent. Often ends up in bounce mode.
A6	Multiple NVMe in RAID0 (topology-aware pairing)	✅ Yes	DGX-like: NVMe pairs on same switch → RAID0 groups matched to GPUs.

Remote NVMe (NVMe-oF / RDMA / Distributed FS)

For remote storage, the DMA engine is the NIC, not NVMe controller. The NIC must RDMA to GPU memory.

ID	Topology	P2P?	Notes
B1	GPU + NIC under same PCIe switch →→→ RDMA →→→ NVMe-oF Target	✅ Best	NIC can RDMA directly to GPU memory. Best for remote.
B2	GPU + NIC under different switches, same root complex	⚠️ Maybe	Same caveats: ACS/IOMMU/platform dependent.
B3	GPU + NIC on different CPU sockets / root complexes	❌ No	Commonly becomes bounce on client side.
B4	Distributed FS with GDS support (BeeGFS, Spectrum Scale, etc.)	✅/⚠️	Depends on NIC↔GPU topology. FS client routes I/O to GDS.

⚠️

"It Works But Bounces" Scenarios

Even with good PCIe topology, you can trigger bounce paths:

Managed/UVM memory: cudaMallocManaged may use internal bounce buffers
IOMMU enabled: Breaks P2P on many platforms
ACS enabled: Access Control Services can block P2P
Missing O_DIRECT: Falls back through page cache

Run gdscheck -p to verify your actual P2P capability!

🌐

Traditional Path Summary

NVMe → Host DRAM → CPU copy → cudaMemcpy → GPU

2-3 copies, CPU bandwidth limited

🚀

GPUDirect Storage Summary

NVMe ↔ GPU HBM (direct PCIe P2P DMA)

Zero host copies, CPU sets up DMA only

GPU ↔ NVMe Data Paths

Jump To