Main A: GPU B: NVMe C: Production
Appendix B.16

GPU ↔ NVMe Data Paths

Visual Guide to Storage Architectures & PCIe Topologies

Jump To

3
πŸ”€

GPU ↔ NVMe Data Paths & Topologies

Traditional (bounce buffer) vs GPUDirect Storage (direct DMA) + all PCIe topologies

πŸ’‘

Key Clarification: "No CPU Involvement"

"No CPU involvement" means no CPU data movement (copies through CPU memory). The CPU is still involved in the control path: submitting I/O, setting up DMA mappings, filesystem metadata, etc. NVIDIA explicitly documents this: drivers run on CPU for control path, while data path avoids host-DRAM staging copies in the bulk path (platform dependent).

A) Traditional Path: CPU + Host DRAM Bounce Buffer

SLOW READ: NVMe → GPU (pread + cudaMemcpy)

NVMe SSD PCIe DMA NVMe Controller DMA Host DRAM (page cache / kernel buffers) ⚠️ CPU memcpy User Buffer (malloc - pageable) ⚠️ staging copy Pinned Staging (if pageable source) cudaMemcpy H2D GPU HBM (global memory)

SLOW WRITE: GPU → NVMe (cudaMemcpy + pwrite)

GPU HBM (global memory) cudaMemcpy D2H User Buffer (pinned recommended) ⚠️ kernel copy Host DRAM (page cache / writeback) NVMe DMA NVMe SSD Why Traditional Path is Slow: β€’ POSIX pread/pwrite operate on CPU buffers (not GPU) β€’ 2-3 memory copies: storage → host → GPU β€’ Pageable buffers require internal staging β€’ CPU memory bandwidth becomes bottleneck

B) GPUDirect Storage: Direct DMA (No Host Bounce)

FAST READ: NVMe → GPU (cuFileRead)

CONTROL PATH (CPU still involved) Application cuFile API nvidia-fs.ko DATA PATH (Direct DMA - no bounce) NVMe SSD PCIe P2P Direct DMA GPU HBM (target buffer) Host DRAM βœ“ BYPASSED cuFileRead( handle, gpu_ptr,//GPU! size, off); Requirements: β€’ O_DIRECT flag β€’ 4KB aligned buffer β€’ nvidia-fs.ko loaded β€’ cuFileBufRegister() β€’ IOMMU/ACS disabled

FAST WRITE: GPU → NVMe (cuFileWrite)

CONTROL PATH (CPU still involved) Application cuFile API nvidia-fs.ko DATA PATH (Direct DMA - no bounce) GPU HBM (source buffer) PCIe P2P Direct DMA NVMe SSD Host DRAM βœ“ BYPASSED Benefits: βœ“ Zero host copies βœ“ Full PCIe bandwidth βœ“ Lower latency βœ“ CPU free for compute One-line summary: NVMe DMA ↔ GPU HBM CPU sets up mappings, data flows directly

C) GPUDirect Storage: Queue-Based Architecture

Just like RDMA extends NVMe over fabric, GPUDirect Storage extends NVMe to GPU memory. SW-HW communication through Submission & Completion queues β€” GPU memory as DMA target.

GPUDirect Storage: GPU as DMA Target for NVMe Software Memory Hardware cuFile API nvidia-fs.ko SW-HW communication through work & completion queues in shared memory GPU Memory (HBM) β€” Data Buffer Submission Completion cuFileBufRegister(gpu_ptr) NVMe Queues (Host Memory) Submission Completion NVMe I/O Queue Pair GPU NVMe Flash PCIe P2P Direct DMA Key: DMA target address in NVMe command points to GPU BAR memory, not host DRAM

NVMe-oF (NVMe over Fabrics) with GPUDirect RDMA

For remote storage, NVMe-oF uses RDMA to extend NVMe across the network. With GPUDirect RDMA, the NIC can DMA directly to GPU memory β€” combining GDS + GPUDirect RDMA.

NVMe-oF with GPUDirect RDMA Block Device / Native Application (cuFile) NVMe Transport Layer NVMe local NVMe Fabric Initiator RDMA NVMe Device GPU (HBM Memory) GPUDirect RDMA Target SCSI iSCSI TCP/IP NVMe local NVMe Device (Traditional paths) Fabric (InfiniBand / RoCE / iWARP) RDMA NVMe Fabric Target NVMe Device TCP/IP iSCSI SCSI Target SW SAS/sATA Device GDS + NVMe-oF Data Path RDMA (GPUDirect) NVMe Device Direct Data Path (no host-DRAM bounce (when topology/IOMMU allow)) NIC DMAs directly to GPU HBM via fabric
πŸš€

The Complete Picture

Local NVMe: NVMe controller DMA → GPU HBM (GDS)
Remote NVMe-oF: NIC RDMA → GPU HBM (GPUDirect RDMA)

Both use the same queue-based model. The key is that the DMA target address points to GPU memory (via PCIe BAR) instead of host DRAM.

D) PCIe Topologies: All Possible Configurations

P2P capability depends on whether the NVMe controller (local) or NIC (remote) can DMA directly to GPU memory via PCIe. These topologies determine if you get true P2P or fallback bounce.

Local NVMe (Direct-Attached PCIe)

ID Topology P2P? Notes
A1 GPU + NVMe under same PCIe switch βœ… Best Canonical P2P. DGX systems pair GPU+NVMe this way. Shortest path.
A2 GPU + NVMe under different switches, same root complex ⚠️ Maybe Platform/ACS/IOMMU dependent. May work or fall back to bounce.
A3 GPU + NVMe on different root ports (no shared switch) ⚠️ Often problematic Frequently becomes bounce on many servers.
A4 GPU + NVMe on different CPU sockets / root complexes ❌ No Typically forces bounce. Cross-socket P2P rarely works.
A5 NVMe behind RAID/HBA/virtualization layer ⚠️ Depends Driver stack dependent. Often ends up in bounce mode.
A6 Multiple NVMe in RAID0 (topology-aware pairing) βœ… Yes DGX-like: NVMe pairs on same switch → RAID0 groups matched to GPUs.

Remote NVMe (NVMe-oF / RDMA / Distributed FS)

For remote storage, the DMA engine is the NIC, not NVMe controller. The NIC must RDMA to GPU memory.

ID Topology P2P? Notes
B1 GPU + NIC under same PCIe switch →→→ RDMA →→→ NVMe-oF Target βœ… Best NIC can RDMA directly to GPU memory. Best for remote.
B2 GPU + NIC under different switches, same root complex ⚠️ Maybe Same caveats: ACS/IOMMU/platform dependent.
B3 GPU + NIC on different CPU sockets / root complexes ❌ No Commonly becomes bounce on client side.
B4 Distributed FS with GDS support (BeeGFS, Spectrum Scale, etc.) βœ…/⚠️ Depends on NIC↔GPU topology. FS client routes I/O to GDS.
PCIe Topology Comparison: Best (A1) vs Suboptimal (A3/A4) A1: Same PCIe Switch (Best P2P) CPU Root Complex Host DRAM bypassed PCIe Switch (P2P Enabled) GPU HBM Memory NVMe Storage Direct P2P βœ“ Shortest path βœ“ Full PCIe bandwidth βœ“ No root complex traversal A3/A4: Different Branches (Suboptimal) CPU Root Complex PCIe Switch A PCIe Switch B GPU HBM Memory NVMe Storage via root ⚠️ May still work (P2P via root) ❌ Or falls back to bounce Depends on ACS/IOMMU settings
⚠️

"It Works But Bounces" Scenarios

Even with good PCIe topology, you can trigger bounce paths:

  • Managed/UVM memory: cudaMallocManaged may use internal bounce buffers
  • IOMMU enabled: Breaks P2P on many platforms
  • ACS enabled: Access Control Services can block P2P
  • Missing O_DIRECT: Falls back through page cache

Run gdscheck -p to verify your actual P2P capability!

🌐

Traditional Path Summary

NVMe → Host DRAM → CPU copy → cudaMemcpy → GPU

2-3 copies, CPU bandwidth limited

πŸš€

GPUDirect Storage Summary

NVMe ↔ GPU HBM (direct PCIe P2P DMA)

Zero host copies, CPU sets up DMA only