Visual Guide to Storage Architectures & PCIe Topologies
Traditional (bounce buffer) vs GPUDirect Storage (direct DMA) + all PCIe topologies
"No CPU involvement" means no CPU data movement (copies through CPU memory). The CPU is still involved in the control path: submitting I/O, setting up DMA mappings, filesystem metadata, etc. NVIDIA explicitly documents this: drivers run on CPU for control path, while data path avoids host-DRAM staging copies in the bulk path (platform dependent).
Just like RDMA extends NVMe over fabric, GPUDirect Storage extends NVMe to GPU memory. SW-HW communication through Submission & Completion queues β GPU memory as DMA target.
For remote storage, NVMe-oF uses RDMA to extend NVMe across the network. With GPUDirect RDMA, the NIC can DMA directly to GPU memory β combining GDS + GPUDirect RDMA.
Local NVMe: NVMe controller DMA → GPU HBM (GDS)
Remote NVMe-oF: NIC RDMA → GPU HBM (GPUDirect RDMA)
Both use the same queue-based model. The key is that the DMA target address points to GPU memory (via PCIe BAR) instead of host DRAM.
P2P capability depends on whether the NVMe controller (local) or NIC (remote) can DMA directly to GPU memory via PCIe. These topologies determine if you get true P2P or fallback bounce.
| ID | Topology | P2P? | Notes |
|---|---|---|---|
| A1 | GPU + NVMe under same PCIe switch | β Best | Canonical P2P. DGX systems pair GPU+NVMe this way. Shortest path. |
| A2 | GPU + NVMe under different switches, same root complex | β οΈ Maybe | Platform/ACS/IOMMU dependent. May work or fall back to bounce. |
| A3 | GPU + NVMe on different root ports (no shared switch) | β οΈ Often problematic | Frequently becomes bounce on many servers. |
| A4 | GPU + NVMe on different CPU sockets / root complexes | β No | Typically forces bounce. Cross-socket P2P rarely works. |
| A5 | NVMe behind RAID/HBA/virtualization layer | β οΈ Depends | Driver stack dependent. Often ends up in bounce mode. |
| A6 | Multiple NVMe in RAID0 (topology-aware pairing) | β Yes | DGX-like: NVMe pairs on same switch → RAID0 groups matched to GPUs. |
For remote storage, the DMA engine is the NIC, not NVMe controller. The NIC must RDMA to GPU memory.
| ID | Topology | P2P? | Notes |
|---|---|---|---|
| B1 | GPU + NIC under same PCIe switch →→→ RDMA →→→ NVMe-oF Target | β Best | NIC can RDMA directly to GPU memory. Best for remote. |
| B2 | GPU + NIC under different switches, same root complex | β οΈ Maybe | Same caveats: ACS/IOMMU/platform dependent. |
| B3 | GPU + NIC on different CPU sockets / root complexes | β No | Commonly becomes bounce on client side. |
| B4 | Distributed FS with GDS support (BeeGFS, Spectrum Scale, etc.) | β /β οΈ | Depends on NICβGPU topology. FS client routes I/O to GDS. |
Even with good PCIe topology, you can trigger bounce paths:
cudaMallocManaged may use internal bounce buffersRun gdscheck -p to verify your actual P2P capability!
NVMe → Host DRAM → CPU copy → cudaMemcpy → GPU
2-3 copies, CPU bandwidth limited
NVMe β GPU HBM (direct PCIe P2P DMA)
Zero host copies, CPU sets up DMA only