1
PCIe Gen5/Gen6 Impact
⚡ Bandwidth Doubling
PCIe Gen5 doubles bandwidth vs Gen4 (32 GT/s → 64 GT/s). Gen6 doubles again (128 GT/s, silicon 2025). This changes the GPU-storage balance—but there are caveats.
PCIe Generation Comparison
| Generation | Per-Lane Rate | x4 NVMe BW | x16 GPU BW | Availability |
|---|---|---|---|---|
| PCIe 4.0 | 16 GT/s | ~7 GB/s | ~32 GB/s | Ubiquitous |
| PCIe 5.0 | 32 GT/s | ~14 GB/s | ~64 GB/s | Server (2023+) |
| PCIe 6.0 | 64 GT/s | ~28 GB/s | ~128 GB/s | 2025 (emerging) |
Benefits vs Caveats
Benefits
- Single NVMe can saturate older GPU links
- Fewer SSDs needed for same throughput
- Better GPU-to-SSD bandwidth ratio
- Lower PCIe slot count requirements
- Enables larger DMA transfers efficiently
Caveats
- NAND is still NAND—latency unchanged
- Internal SSD parallelism must increase
- Power consumption increases
- Signal integrity challenges (shorter traces)
- Retimers may add latency
📋 Planning Guidance
- 2024: PCIe Gen4 NVMe is cost-effective. Gen5 SSDs available but premium-priced.
- 2025: Gen5 SSDs mainstream. Fewer drives, simpler topologies.
- 2026-27: Gen6 silicon expected. Single-SSD 25+ GB/s. CXL 3.0 may shift architecture.
PCIe Topology Matters
❌ Bad: Through CPU
GPU
↓ x16
CPU (Root Complex)
↓ x4
NVMe SSD
+10-20 µs latency, CPU BW consumed
✓ Good: PCIe Switch
GPU
↓ x16
PCIe Switch
↓ x4
NVMe SSD
Direct P2P DMA, lowest latency
2
NUMA & PCIe Topology Deep Dive
🚨 CRITICAL
Incorrect NUMA/PCIe topology is the #1 cause of unexplained performance degradation. Cross-NUMA access adds 2-5µs per I/O and can reduce throughput by 30-50%.
Understanding NUMA Topology
Bash
# Check NUMA topology with numactl $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0-31 node 0 size: 256000 MB node 1 cpus: 32-63 node 1 size: 256000 MB node distances: node 0 1 0: 10 21 1: 21 10 # Key insight: Distance 21 vs 10 means ~2x latency for cross-NUMA
NVIDIA GPU Topology Matrix
Bash
# nvidia-smi topo -m shows GPU-to-device relationships $ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 mlx5_0 nvme0 NUMA GPU0 X NV12 NV12 NV12 PIX NODE 0 GPU1 NV12 X NV12 NV12 NODE NODE 0 GPU2 NV12 NV12 X NV12 SYS SYS 1 GPU3 NV12 NV12 NV12 X SYS SYS 1 nvme0 NODE NODE SYS SYS NODE X 0 Legend: PIX = Same PCIe switch (FASTEST: <1µs) PHB = PCIe host bridge (FAST: +1-2µs) NODE = Cross-NUMA node (MEDIUM: +2-5µs) SYS = Cross-socket via QPI/UPI (SLOW: +5-10µs)
⚡ Reading the Matrix
Look at GPU→nvme relationships. PIX or PXB is optimal. NODE means cross-NUMA (2-5µs penalty). SYS means cross-socket (5-10µs penalty). Always map GPUs to SSDs on the same NUMA node!
Cross-NUMA Penalty
| Topology | Relationship | Latency | BW Impact | Verdict |
|---|---|---|---|---|
| Same PCIe Switch | PIX / PXB | +0.5-1µs | ~100% | ✓ Ideal |
| Same NUMA, diff switch | PHB | +1-2µs | ~95% | ✓ Good |
| Cross-NUMA (same socket) | NODE | +2-5µs | 70-85% | ⚠ Avoid |
| Cross-Socket (QPI/UPI) | SYS | +5-10µs | 50-70% | ✗ Never |
PCIe ACS Configuration
Bash
# Check ACS status on PCIe bridges $ lspci -vvv | grep -i "Access Control" Capabilities: [148 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpFwd+ ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpFwd- # If ACS blocks P2P, disable it (boot parameter) GRUB_CMDLINE_LINUX="pcie_acs_override=downstream,multifunction" # Verify P2P is working $ nvidia-smi topo -p2p r GPU0 GPU1 nvme0 GPU0 X OK OK ← Should show "OK" nvme0 OK OK X
⚠️ ACS Security Trade-off
Disabling ACS enables P2P but weakens IOMMU isolation. In multi-tenant environments, consider dedicated PCIe switches rather than disabling ACS globally.
🚨 Production Rule
Always run
nvidia-smi topo -m before deploying any GPU storage workload. If you see "SYS" between a GPU and its intended NVMe, STOP and fix the topology. A 30-50% performance loss is guaranteed.
PCIe P2P BAR Mapping Deep Dive
🔧 The Actual Mechanism
GPUDirect Storage works by mapping NVMe's Base Address Registers (BARs) into GPU-accessible address space. Understanding BARs is essential for debugging P2P failures.
BAR Types in GPU-Storage P2P
| BAR | Purpose | Size | P2P Role |
|---|---|---|---|
| BAR0 | Controller registers | 16KB typical | Doorbell access for submission |
| BAR1 (GPU) | GPU framebuffer | 256MB - 64GB | Target for NVMe DMA writes |
| BAR2/BAR4 (NVMe) | Controller Memory Buffer (CMB) | 0 - 128MB | Optional: SQ/CQ in CMB |
Bash - Inspect BAR Configuration
# View GPU BAR configuration $ lspci -vvv -s 41:00.0 | grep -A5 "Region" Region 0: Memory at fb00000000 (64-bit, prefetchable) [size=256M] # Config Region 1: Memory at e000000000 (64-bit, prefetchable) [size=32G] # BAR1 - Framebuffer Region 3: Memory at fc02000000 (64-bit, prefetchable) [size=32M] # BAR2 # View NVMe BAR configuration $ lspci -vvv -s 01:00.0 | grep -A3 "Region" Region 0: Memory at fb200000 (64-bit, non-prefetchable) [size=16K] # Controller # Check if GPU BAR1 is large enough for P2P $ nvidia-smi --query-gpu=name,memory.total,bar1.total --format=csv name, memory.total [MiB], BAR1.total [MiB] NVIDIA H100 80GB HBM3, 81559 MiB, 131072 MiB # 128GB BAR1 = good for P2P # Verify P2P BAR1 mapping works $ cat /proc/driver/nvidia/gpus/0000:41:00.0/information GPU UUID: GPU-xxxxx BAR1 Size: 128 GB P2P Capable: Yes
⚠️ BAR1 Size Matters
- Small BAR1 (256MB): Only config access, no direct P2P data transfers
- Large BAR1 (≥8GB): Required for GDS/P2P data transfers
- Resizable BAR: Enable in BIOS ("Above 4G Decoding" + "Resizable BAR")
Bash - Enable Resizable BAR
# Check current BAR size $ nvidia-smi --query-gpu=bar1.total --format=csv,noheader 256 MiB # Too small for P2P! # Enable in BIOS: # 1. Advanced → PCI Subsystem Settings → Above 4G Decoding: Enabled # 2. Advanced → PCI Subsystem Settings → Resizable BAR Support: Enabled # After reboot, verify $ nvidia-smi --query-gpu=bar1.total --format=csv,noheader 131072 MiB # 128GB - P2P enabled! # Kernel parameter for older systems GRUB_CMDLINE_LINUX="pci=realloc,assign-busses"
IOMMU & VFIO Deep Dive
🚨 IOMMU: Friend or Foe?
IOMMU provides memory isolation but can BLOCK P2P transfers. Understanding IOMMU groups and bypass mechanisms is critical for GDS deployments.
IOMMU Modes for GPU-Storage
| Mode | P2P Status | Security | Use Case |
|---|---|---|---|
| IOMMU Off | Works | None | Dedicated bare-metal |
| IOMMU Passthrough | Works | Partial | Recommended for GDS |
| IOMMU Strict | Blocked | Full | Multi-tenant, VMs |
| VFIO + P2P | Requires setup | Full | VM passthrough with P2P |
Bash - IOMMU Configuration
# Check IOMMU status $ dmesg | grep -i iommu [ 0.000000] DMAR: IOMMU enabled [ 0.123456] AMD-Vi: IOMMU performance counters supported # View IOMMU groups (devices in same group can P2P) $ for d in /sys/kernel/iommu_groups/*/devices/*; do n=$(basename $(dirname $(dirname $d))) echo "IOMMU Group $n: $(lspci -nns ${d##*/})" done | grep -E "NVIDIA|NVMe" IOMMU Group 15: 41:00.0 3D controller: NVIDIA H100 [10de:2330] IOMMU Group 15: 42:00.0 NVMe: Samsung PM9A3 [144d:a80a] # Same group = P2P OK IOMMU Group 28: 81:00.0 NVMe: Intel P5800X [8086:0a54] # Different group! # Enable IOMMU passthrough mode (recommended for GDS) # /etc/default/grub GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt" # Intel GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt" # AMD # Apply and reboot $ sudo update-grub && sudo reboot
Bash - VFIO Setup for GPU Passthrough with P2P
# Bind GPU and NVMe to VFIO (for VM passthrough) # 1. Load VFIO modules $ sudo modprobe vfio-pci # 2. Unbind from native drivers $ echo "0000:41:00.0" | sudo tee /sys/bus/pci/devices/0000:41:00.0/driver/unbind $ echo "0000:42:00.0" | sudo tee /sys/bus/pci/devices/0000:42:00.0/driver/unbind # 3. Bind to VFIO $ echo "10de 2330" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id # GPU $ echo "144d a80a" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id # NVMe # 4. Verify VFIO binding $ ls -la /dev/vfio/ total 0 drwxr-xr-x 2 root root 80 Dec 29 10:00 . drwxr-xr-x 21 root root 4200 Dec 29 10:00 .. crw------- 1 root root 243, 0 Dec 29 10:00 15 # IOMMU group 15 crw-rw-rw- 1 root root 10, 196 Dec 29 10:00 vfio # 5. QEMU/KVM command for passthrough with P2P $ qemu-system-x86_64 \ -device vfio-pci,host=41:00.0,multifunction=on \ -device vfio-pci,host=42:00.0 \ -machine q35,accel=kvm,kernel-irqchip=split \ -cpu host \ ...
⚡ P2P in VMs: The Trick
- GPU and NVMe must be in the same IOMMU group
- Use
pcie_acs_overrideif they're not (security trade-off) - Pass both devices to the same VM
- Inside VM: standard GDS setup works
Bash - Debugging P2P Failures
# P2P diagnostic checklist # 1. Check if P2P is possible $ nvidia-smi topo -p2p r GPU0 nvme0n1 GPU0 X OK # OK = P2P possible nvme0 OK X # NS = Not Supported # 2. If "NS", check IOMMU groups $ find /sys/kernel/iommu_groups -name "41:00.0" -o -name "42:00.0" /sys/kernel/iommu_groups/15/devices/0000:41:00.0 /sys/kernel/iommu_groups/28/devices/0000:42:00.0 # Different groups! # 3. Check ACS on path $ lspci -vvv -s 00:01.0 | grep -A10 "Access Control" ACSCtl: SrcValid+ TransBlk+ # Enabled = blocks P2P # 4. Check BAR sizes $ nvidia-smi --query-gpu=bar1.total --format=csv 131072 MiB # Need >256MB for P2P # 5. Check PCIe link $ lspci -vvv -s 41:00.0 | grep -E "LnkCap|LnkSta" LnkCap: Speed 16GT/s, Width x16 LnkSta: Speed 16GT/s, Width x16 # Matches = good # 6. GDS-specific check $ /usr/local/cuda/gds/tools/gdscheck -p GDS cuFile configuration: Properties File: /etc/cufile.json Platform compatibility: SUPPORTED P2P support: ENABLED
3
Multi-Vendor GPU Storage
⚡ Beyond NVIDIA
While GDS is NVIDIA-specific, AMD and Intel GPUs have their own direct storage paths. Production deployments increasingly need vendor-neutral strategies.
AMD ROCm vs NVIDIA GDS
| Feature | NVIDIA GDS | AMD ROCm |
|---|---|---|
| Direct Storage API | cuFile API | hipMemcpy + RDMA |
| P2P DMA | GPUDirect Storage | PCIe P2P (ROCm 5.0+) |
| RDMA Support | GPUDirect RDMA | ROCm RDMA |
| Max Throughput | ~14 GB/s (H100) | ~12 GB/s (MI300X) |
Intel GPU Storage
| Intel GPU | Storage Path | Status |
|---|---|---|
| Data Center GPU Max 1550 | oneAPI Level Zero + DMAbuf | Limited support |
| Flex Series | Standard PCIe DMA | Available |
| Arc (Consumer) | System memory only | Not supported |
Vendor-Neutral Strategy
Python
# RAPIDS kvikio: Vendor-neutral GPU I/O library import kvikio # Works across GPU vendors with appropriate backend with kvikio.CuFile("/data/model.bin", "r") as f: # Automatically uses GDS on NVIDIA, fallback on others data = f.read(gpu_buffer)
4
GPU Interconnect Technologies
| Technology | Bandwidth | Latency | Scope | Storage Use |
|---|---|---|---|---|
| NVLink 4.0 | 900 GB/s (bidirectional) | ~0.5-1 µs | GPU-to-GPU | Tensor transfer |
| NVSwitch | 900 GB/s all-to-all | ~1 µs | Intra-node fabric | Checkpoint aggregation |
| PCIe Gen5 x16 | 64 GB/s | ~2-3 µs | GPU↔NVMe, GPU↔NIC | GDS, GPUDirect RDMA |
| InfiniBand NDR | 400 Gbps (50 GB/s) | ~1-2 µs | Inter-node | NVMe-oF, RDMA storage |
| RoCEv2 | 400 Gbps | ~2-5 µs | Inter-node (Ethernet) | NVMe-oF, cloud |
5
GPUDirect Technology Stack
GPUDirect Storage LOCAL
Direct DMA between local NVMe and GPU memory. Minimal CPU in bulk transfer path after setup.
Use case: Checkpoint read/write, dataset loadingGPUDirect RDMA NETWORK
Direct DMA between GPU memory and remote node via InfiniBand/RoCE.
Use case: NCCL collectives, distributed checkpointingGPUDirect P2P INTRA-NODE
Direct GPU↔GPU transfers via NVLink or PCIe.
Use case: Tensor parallelism, pipeline parallelism6
Distributed Checkpoint Patterns
Pattern 1: Aggregated Checkpointing
All ranks write to rank 0, which writes to storage. Recommended for large models.
Python
def save_checkpoint_aggregated(model, optimizer, path): state = { 'model': model.state_dict(), 'optimizer': optimizer.state_dict(), } if dist.get_rank() == 0: # Only rank 0 writes - avoids storage contention torch.save(state, path) dist.barrier() # Synchronize all ranks
Pattern 2: Sharded Checkpointing
Each rank writes its own shard. For very large models with parallel writes.
Python
def save_checkpoint_sharded(model, optimizer, base_path): rank = dist.get_rank() shard_path = f"{base_path}/shard_{rank}.pt" # Each GPU writes its local shard state = { 'model_shard': get_local_state(model), 'optimizer_shard': get_local_optimizer_state(optimizer), } # Use GDS for direct GPU → NVMe write with cufile.open(shard_path, 'wb') as f: f.write(serialize(state)) dist.barrier()
7
Cluster Storage Topology
Best Practice
Use local NVMe with GDS for training data (maximum speed), and shared storage (Lustre/WekaFS) for checkpoints (durability and accessibility across nodes).