Main A: GPU B: NVMe C: Production
C.4 • Deploy Phase

Architecture & Topology

PCIe Gen5/6 impact, NUMA topology deep dive, multi-vendor GPU support, GPU interconnects, and distributed checkpoint patterns.

1

PCIe Gen5/Gen6 Impact

⚡ Bandwidth Doubling PCIe Gen5 doubles bandwidth vs Gen4 (32 GT/s → 64 GT/s). Gen6 doubles again (128 GT/s, silicon 2025). This changes the GPU-storage balance—but there are caveats.

PCIe Generation Comparison

Generation Per-Lane Rate x4 NVMe BW x16 GPU BW Availability
PCIe 4.0 16 GT/s ~7 GB/s ~32 GB/s Ubiquitous
PCIe 5.0 32 GT/s ~14 GB/s ~64 GB/s Server (2023+)
PCIe 6.0 64 GT/s ~28 GB/s ~128 GB/s 2025 (emerging)

Benefits vs Caveats

Benefits

  • Single NVMe can saturate older GPU links
  • Fewer SSDs needed for same throughput
  • Better GPU-to-SSD bandwidth ratio
  • Lower PCIe slot count requirements
  • Enables larger DMA transfers efficiently

Caveats

  • NAND is still NAND—latency unchanged
  • Internal SSD parallelism must increase
  • Power consumption increases
  • Signal integrity challenges (shorter traces)
  • Retimers may add latency
📋 Planning Guidance
  • 2024: PCIe Gen4 NVMe is cost-effective. Gen5 SSDs available but premium-priced.
  • 2025: Gen5 SSDs mainstream. Fewer drives, simpler topologies.
  • 2026-27: Gen6 silicon expected. Single-SSD 25+ GB/s. CXL 3.0 may shift architecture.

PCIe Topology Matters

❌ Bad: Through CPU
GPU
↓ x16
CPU (Root Complex)
↓ x4
NVMe SSD
+10-20 µs latency, CPU BW consumed
✓ Good: PCIe Switch
GPU
↓ x16
PCIe Switch
↓ x4
NVMe SSD
Direct P2P DMA, lowest latency
2

NUMA & PCIe Topology Deep Dive

🚨 CRITICAL Incorrect NUMA/PCIe topology is the #1 cause of unexplained performance degradation. Cross-NUMA access adds 2-5µs per I/O and can reduce throughput by 30-50%.

Understanding NUMA Topology

Bash
# Check NUMA topology with numactl
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-31
node 0 size: 256000 MB
node 1 cpus: 32-63
node 1 size: 256000 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

# Key insight: Distance 21 vs 10 means ~2x latency for cross-NUMA

NVIDIA GPU Topology Matrix

Bash
# nvidia-smi topo -m shows GPU-to-device relationships
$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  nvme0   NUMA
GPU0     X      NV12    NV12    NV12    PIX     NODE    0
GPU1    NV12     X      NV12    NV12    NODE    NODE    0
GPU2    NV12    NV12     X      NV12    SYS     SYS     1
GPU3    NV12    NV12    NV12     X      SYS     SYS     1
nvme0   NODE    NODE    SYS     SYS     NODE     X      0

Legend:
  PIX  = Same PCIe switch (FASTEST: <1µs)
  PHB  = PCIe host bridge (FAST: +1-2µs)
  NODE = Cross-NUMA node (MEDIUM: +2-5µs)
  SYS  = Cross-socket via QPI/UPI (SLOW: +5-10µs)
⚡ Reading the Matrix Look at GPU→nvme relationships. PIX or PXB is optimal. NODE means cross-NUMA (2-5µs penalty). SYS means cross-socket (5-10µs penalty). Always map GPUs to SSDs on the same NUMA node!

Cross-NUMA Penalty

Topology Relationship Latency BW Impact Verdict
Same PCIe Switch PIX / PXB +0.5-1µs ~100% ✓ Ideal
Same NUMA, diff switch PHB +1-2µs ~95% ✓ Good
Cross-NUMA (same socket) NODE +2-5µs 70-85% ⚠ Avoid
Cross-Socket (QPI/UPI) SYS +5-10µs 50-70% ✗ Never

PCIe ACS Configuration

Bash
# Check ACS status on PCIe bridges
$ lspci -vvv | grep -i "Access Control"
    Capabilities: [148 v1] Access Control Services
        ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpFwd+
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpFwd-
        
# If ACS blocks P2P, disable it (boot parameter)
GRUB_CMDLINE_LINUX="pcie_acs_override=downstream,multifunction"

# Verify P2P is working
$ nvidia-smi topo -p2p r
        GPU0    GPU1    nvme0
GPU0     X       OK      OK   ← Should show "OK"
nvme0   OK      OK       X
⚠️ ACS Security Trade-off Disabling ACS enables P2P but weakens IOMMU isolation. In multi-tenant environments, consider dedicated PCIe switches rather than disabling ACS globally.
🚨 Production Rule Always run nvidia-smi topo -m before deploying any GPU storage workload. If you see "SYS" between a GPU and its intended NVMe, STOP and fix the topology. A 30-50% performance loss is guaranteed.

PCIe P2P BAR Mapping Deep Dive

🔧 The Actual Mechanism GPUDirect Storage works by mapping NVMe's Base Address Registers (BARs) into GPU-accessible address space. Understanding BARs is essential for debugging P2P failures.

BAR Types in GPU-Storage P2P

BAR Purpose Size P2P Role
BAR0 Controller registers 16KB typical Doorbell access for submission
BAR1 (GPU) GPU framebuffer 256MB - 64GB Target for NVMe DMA writes
BAR2/BAR4 (NVMe) Controller Memory Buffer (CMB) 0 - 128MB Optional: SQ/CQ in CMB
Bash - Inspect BAR Configuration
# View GPU BAR configuration
$ lspci -vvv -s 41:00.0 | grep -A5 "Region"
    Region 0: Memory at fb00000000 (64-bit, prefetchable) [size=256M]   # Config
    Region 1: Memory at e000000000 (64-bit, prefetchable) [size=32G]    # BAR1 - Framebuffer
    Region 3: Memory at fc02000000 (64-bit, prefetchable) [size=32M]    # BAR2

# View NVMe BAR configuration
$ lspci -vvv -s 01:00.0 | grep -A3 "Region"
    Region 0: Memory at fb200000 (64-bit, non-prefetchable) [size=16K]  # Controller

# Check if GPU BAR1 is large enough for P2P
$ nvidia-smi --query-gpu=name,memory.total,bar1.total --format=csv
name, memory.total [MiB], BAR1.total [MiB]
NVIDIA H100 80GB HBM3, 81559 MiB, 131072 MiB   # 128GB BAR1 = good for P2P

# Verify P2P BAR1 mapping works
$ cat /proc/driver/nvidia/gpus/0000:41:00.0/information
GPU UUID: GPU-xxxxx
BAR1 Size: 128 GB
P2P Capable: Yes
⚠️ BAR1 Size Matters
  • Small BAR1 (256MB): Only config access, no direct P2P data transfers
  • Large BAR1 (≥8GB): Required for GDS/P2P data transfers
  • Resizable BAR: Enable in BIOS ("Above 4G Decoding" + "Resizable BAR")
Bash - Enable Resizable BAR
# Check current BAR size
$ nvidia-smi --query-gpu=bar1.total --format=csv,noheader
256 MiB    # Too small for P2P!

# Enable in BIOS:
# 1. Advanced → PCI Subsystem Settings → Above 4G Decoding: Enabled
# 2. Advanced → PCI Subsystem Settings → Resizable BAR Support: Enabled

# After reboot, verify
$ nvidia-smi --query-gpu=bar1.total --format=csv,noheader
131072 MiB   # 128GB - P2P enabled!

# Kernel parameter for older systems
GRUB_CMDLINE_LINUX="pci=realloc,assign-busses"

IOMMU & VFIO Deep Dive

🚨 IOMMU: Friend or Foe? IOMMU provides memory isolation but can BLOCK P2P transfers. Understanding IOMMU groups and bypass mechanisms is critical for GDS deployments.

IOMMU Modes for GPU-Storage

Mode P2P Status Security Use Case
IOMMU Off Works None Dedicated bare-metal
IOMMU Passthrough Works Partial Recommended for GDS
IOMMU Strict Blocked Full Multi-tenant, VMs
VFIO + P2P Requires setup Full VM passthrough with P2P
Bash - IOMMU Configuration
# Check IOMMU status
$ dmesg | grep -i iommu
[    0.000000] DMAR: IOMMU enabled
[    0.123456] AMD-Vi: IOMMU performance counters supported

# View IOMMU groups (devices in same group can P2P)
$ for d in /sys/kernel/iommu_groups/*/devices/*; do
    n=$(basename $(dirname $(dirname $d)))
    echo "IOMMU Group $n: $(lspci -nns ${d##*/})"
done | grep -E "NVIDIA|NVMe"

IOMMU Group 15: 41:00.0 3D controller: NVIDIA H100 [10de:2330]
IOMMU Group 15: 42:00.0 NVMe: Samsung PM9A3 [144d:a80a]   # Same group = P2P OK
IOMMU Group 28: 81:00.0 NVMe: Intel P5800X [8086:0a54]    # Different group!

# Enable IOMMU passthrough mode (recommended for GDS)
# /etc/default/grub
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"   # Intel
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"     # AMD

# Apply and reboot
$ sudo update-grub && sudo reboot
Bash - VFIO Setup for GPU Passthrough with P2P
# Bind GPU and NVMe to VFIO (for VM passthrough)

# 1. Load VFIO modules
$ sudo modprobe vfio-pci

# 2. Unbind from native drivers
$ echo "0000:41:00.0" | sudo tee /sys/bus/pci/devices/0000:41:00.0/driver/unbind
$ echo "0000:42:00.0" | sudo tee /sys/bus/pci/devices/0000:42:00.0/driver/unbind

# 3. Bind to VFIO
$ echo "10de 2330" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id  # GPU
$ echo "144d a80a" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id  # NVMe

# 4. Verify VFIO binding
$ ls -la /dev/vfio/
total 0
drwxr-xr-x  2 root root      80 Dec 29 10:00 .
drwxr-xr-x 21 root root    4200 Dec 29 10:00 ..
crw-------  1 root root 243, 0 Dec 29 10:00 15   # IOMMU group 15
crw-rw-rw-  1 root root 10, 196 Dec 29 10:00 vfio

# 5. QEMU/KVM command for passthrough with P2P
$ qemu-system-x86_64 \
    -device vfio-pci,host=41:00.0,multifunction=on \
    -device vfio-pci,host=42:00.0 \
    -machine q35,accel=kvm,kernel-irqchip=split \
    -cpu host \
    ...
⚡ P2P in VMs: The Trick
  • GPU and NVMe must be in the same IOMMU group
  • Use pcie_acs_override if they're not (security trade-off)
  • Pass both devices to the same VM
  • Inside VM: standard GDS setup works
Bash - Debugging P2P Failures
# P2P diagnostic checklist

# 1. Check if P2P is possible
$ nvidia-smi topo -p2p r
        GPU0    nvme0n1
GPU0     X      OK      # OK = P2P possible
nvme0   OK       X       # NS = Not Supported

# 2. If "NS", check IOMMU groups
$ find /sys/kernel/iommu_groups -name "41:00.0" -o -name "42:00.0"
/sys/kernel/iommu_groups/15/devices/0000:41:00.0
/sys/kernel/iommu_groups/28/devices/0000:42:00.0   # Different groups!

# 3. Check ACS on path
$ lspci -vvv -s 00:01.0 | grep -A10 "Access Control"
    ACSCtl: SrcValid+ TransBlk+   # Enabled = blocks P2P

# 4. Check BAR sizes
$ nvidia-smi --query-gpu=bar1.total --format=csv
131072 MiB   # Need >256MB for P2P

# 5. Check PCIe link
$ lspci -vvv -s 41:00.0 | grep -E "LnkCap|LnkSta"
    LnkCap: Speed 16GT/s, Width x16
    LnkSta: Speed 16GT/s, Width x16   # Matches = good

# 6. GDS-specific check
$ /usr/local/cuda/gds/tools/gdscheck -p
GDS cuFile configuration:
    Properties File: /etc/cufile.json
    Platform compatibility: SUPPORTED
    P2P support: ENABLED
3

Multi-Vendor GPU Storage

⚡ Beyond NVIDIA While GDS is NVIDIA-specific, AMD and Intel GPUs have their own direct storage paths. Production deployments increasingly need vendor-neutral strategies.

AMD ROCm vs NVIDIA GDS

Feature NVIDIA GDS AMD ROCm
Direct Storage API cuFile API hipMemcpy + RDMA
P2P DMA GPUDirect Storage PCIe P2P (ROCm 5.0+)
RDMA Support GPUDirect RDMA ROCm RDMA
Max Throughput ~14 GB/s (H100) ~12 GB/s (MI300X)

Intel GPU Storage

Intel GPU Storage Path Status
Data Center GPU Max 1550 oneAPI Level Zero + DMAbuf Limited support
Flex Series Standard PCIe DMA Available
Arc (Consumer) System memory only Not supported

Vendor-Neutral Strategy

Python
# RAPIDS kvikio: Vendor-neutral GPU I/O library
import kvikio

# Works across GPU vendors with appropriate backend
with kvikio.CuFile("/data/model.bin", "r") as f:
    # Automatically uses GDS on NVIDIA, fallback on others
    data = f.read(gpu_buffer)
4

GPU Interconnect Technologies

Technology Bandwidth Latency Scope Storage Use
NVLink 4.0 900 GB/s (bidirectional) ~0.5-1 µs GPU-to-GPU Tensor transfer
NVSwitch 900 GB/s all-to-all ~1 µs Intra-node fabric Checkpoint aggregation
PCIe Gen5 x16 64 GB/s ~2-3 µs GPU↔NVMe, GPU↔NIC GDS, GPUDirect RDMA
InfiniBand NDR 400 Gbps (50 GB/s) ~1-2 µs Inter-node NVMe-oF, RDMA storage
RoCEv2 400 Gbps ~2-5 µs Inter-node (Ethernet) NVMe-oF, cloud
5

GPUDirect Technology Stack

GPUDirect Storage LOCAL

Direct DMA between local NVMe and GPU memory. Minimal CPU in bulk transfer path after setup.

Use case: Checkpoint read/write, dataset loading

GPUDirect RDMA NETWORK

Direct DMA between GPU memory and remote node via InfiniBand/RoCE.

Use case: NCCL collectives, distributed checkpointing

GPUDirect P2P INTRA-NODE

Direct GPU↔GPU transfers via NVLink or PCIe.

Use case: Tensor parallelism, pipeline parallelism
6

Distributed Checkpoint Patterns

Pattern 1: Aggregated Checkpointing

All ranks write to rank 0, which writes to storage. Recommended for large models.

Python
def save_checkpoint_aggregated(model, optimizer, path):
    state = {
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
    }
    
    if dist.get_rank() == 0:
        # Only rank 0 writes - avoids storage contention
        torch.save(state, path)
    
    dist.barrier()  # Synchronize all ranks

Pattern 2: Sharded Checkpointing

Each rank writes its own shard. For very large models with parallel writes.

Python
def save_checkpoint_sharded(model, optimizer, base_path):
    rank = dist.get_rank()
    shard_path = f"{base_path}/shard_{rank}.pt"
    
    # Each GPU writes its local shard
    state = {
        'model_shard': get_local_state(model),
        'optimizer_shard': get_local_optimizer_state(optimizer),
    }
    
    # Use GDS for direct GPU → NVMe write
    with cufile.open(shard_path, 'wb') as f:
        f.write(serialize(state))
    
    dist.barrier()
7

Cluster Storage Topology

Node 1 8x H100 GPUs Local NVMe Node 2 8x H100 GPUs Local NVMe IB NDR Shared Storage Lustre / WekaFS NVMe-oF Target Training: Local NVMe + GDS for speed Checkpoints: Shared storage for durability
Best Practice Use local NVMe with GDS for training data (maximum speed), and shared storage (Lustre/WekaFS) for checkpoints (durability and accessibility across nodes).