C.4 Architecture & Topology | GPU-NVMe Deep Dive

PCIe Gen5/Gen6 Impact

⚡ Bandwidth Doubling PCIe Gen5 doubles bandwidth vs Gen4 (32 GT/s → 64 GT/s). Gen6 doubles again (128 GT/s, silicon 2025). This changes the GPU-storage balance—but there are caveats.

PCIe Generation Comparison

Generation	Per-Lane Rate	x4 NVMe BW	x16 GPU BW	Availability
PCIe 4.0	16 GT/s	~7 GB/s	~32 GB/s	Ubiquitous
PCIe 5.0	32 GT/s	~14 GB/s	~64 GB/s	Server (2023+)
PCIe 6.0	64 GT/s	~28 GB/s	~128 GB/s	2025 (emerging)

Benefits vs Caveats

Benefits

Single NVMe can saturate older GPU links
Fewer SSDs needed for same throughput
Better GPU-to-SSD bandwidth ratio
Lower PCIe slot count requirements
Enables larger DMA transfers efficiently

Caveats

NAND is still NAND—latency unchanged
Internal SSD parallelism must increase
Power consumption increases
Signal integrity challenges (shorter traces)
Retimers may add latency

📋 Planning Guidance

2024: PCIe Gen4 NVMe is cost-effective. Gen5 SSDs available but premium-priced.
2025: Gen5 SSDs mainstream. Fewer drives, simpler topologies.
2026-27: Gen6 silicon expected. Single-SSD 25+ GB/s. CXL 3.0 may shift architecture.

PCIe Topology Matters

❌ Bad: Through CPU

GPU

↓ x16

CPU (Root Complex)

↓ x4

NVMe SSD

+10-20 µs latency, CPU BW consumed

✓ Good: PCIe Switch

GPU

↓ x16

PCIe Switch

↓ x4

NVMe SSD

Direct P2P DMA, lowest latency

NUMA & PCIe Topology Deep Dive

🚨 CRITICAL Incorrect NUMA/PCIe topology is the #1 cause of unexplained performance degradation. Cross-NUMA access adds 2-5µs per I/O and can reduce throughput by 30-50%.

Understanding NUMA Topology

Bash

# Check NUMA topology with numactl
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-31
node 0 size: 256000 MB
node 1 cpus: 32-63
node 1 size: 256000 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

# Key insight: Distance 21 vs 10 means ~2x latency for cross-NUMA

NVIDIA GPU Topology Matrix

Bash

# nvidia-smi topo -m shows GPU-to-device relationships
$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  nvme0   NUMA
GPU0     X      NV12    NV12    NV12    PIX     NODE    0
GPU1    NV12     X      NV12    NV12    NODE    NODE    0
GPU2    NV12    NV12     X      NV12    SYS     SYS     1
GPU3    NV12    NV12    NV12     X      SYS     SYS     1
nvme0   NODE    NODE    SYS     SYS     NODE     X      0

Legend:
  PIX  = Same PCIe switch (FASTEST: <1µs)
  PHB  = PCIe host bridge (FAST: +1-2µs)
  NODE = Cross-NUMA node (MEDIUM: +2-5µs)
  SYS  = Cross-socket via QPI/UPI (SLOW: +5-10µs)

⚡ Reading the Matrix Look at GPU→nvme relationships. PIX or PXB is optimal. NODE means cross-NUMA (2-5µs penalty). SYS means cross-socket (5-10µs penalty). Always map GPUs to SSDs on the same NUMA node!

Cross-NUMA Penalty

Topology	Relationship	Latency	BW Impact	Verdict
Same PCIe Switch	PIX / PXB	+0.5-1µs	~100%	✓ Ideal
Same NUMA, diff switch	PHB	+1-2µs	~95%	✓ Good
Cross-NUMA (same socket)	NODE	+2-5µs	70-85%	⚠ Avoid
Cross-Socket (QPI/UPI)	SYS	+5-10µs	50-70%	✗ Never

PCIe ACS Configuration

Bash

# Check ACS status on PCIe bridges
$ lspci -vvv | grep -i "Access Control"
    Capabilities: [148 v1] Access Control Services
        ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpFwd+
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpFwd-
        
# If ACS blocks P2P, disable it (boot parameter)
GRUB_CMDLINE_LINUX="pcie_acs_override=downstream,multifunction"

# Verify P2P is working
$ nvidia-smi topo -p2p r
        GPU0    GPU1    nvme0
GPU0     X       OK      OK   ← Should show "OK"
nvme0   OK      OK       X

⚠️ ACS Security Trade-off Disabling ACS enables P2P but weakens IOMMU isolation. In multi-tenant environments, consider dedicated PCIe switches rather than disabling ACS globally.

🚨 Production Rule Always run nvidia-smi topo -m before deploying any GPU storage workload. If you see "SYS" between a GPU and its intended NVMe, STOP and fix the topology. A 30-50% performance loss is guaranteed.

PCIe P2P BAR Mapping Deep Dive

🔧 The Actual Mechanism GPUDirect Storage works by mapping NVMe's Base Address Registers (BARs) into GPU-accessible address space. Understanding BARs is essential for debugging P2P failures.

BAR Types in GPU-Storage P2P

BAR	Purpose	Size	P2P Role
BAR0	Controller registers	16KB typical	Doorbell access for submission
BAR1 (GPU)	GPU framebuffer	256MB - 64GB	Target for NVMe DMA writes
BAR2/BAR4 (NVMe)	Controller Memory Buffer (CMB)	0 - 128MB	Optional: SQ/CQ in CMB

Bash - Inspect BAR Configuration

# View GPU BAR configuration
$ lspci -vvv -s 41:00.0 | grep -A5 "Region"
    Region 0: Memory at fb00000000 (64-bit, prefetchable) [size=256M]   # Config
    Region 1: Memory at e000000000 (64-bit, prefetchable) [size=32G]    # BAR1 - Framebuffer
    Region 3: Memory at fc02000000 (64-bit, prefetchable) [size=32M]    # BAR2

# View NVMe BAR configuration
$ lspci -vvv -s 01:00.0 | grep -A3 "Region"
    Region 0: Memory at fb200000 (64-bit, non-prefetchable) [size=16K]  # Controller

# Check if GPU BAR1 is large enough for P2P
$ nvidia-smi --query-gpu=name,memory.total,bar1.total --format=csv
name, memory.total [MiB], BAR1.total [MiB]
NVIDIA H100 80GB HBM3, 81559 MiB, 131072 MiB   # 128GB BAR1 = good for P2P

# Verify P2P BAR1 mapping works
$ cat /proc/driver/nvidia/gpus/0000:41:00.0/information
GPU UUID: GPU-xxxxx
BAR1 Size: 128 GB
P2P Capable: Yes

⚠️ BAR1 Size Matters

Small BAR1 (256MB): Only config access, no direct P2P data transfers
Large BAR1 (≥8GB): Required for GDS/P2P data transfers
Resizable BAR: Enable in BIOS ("Above 4G Decoding" + "Resizable BAR")

Bash - Enable Resizable BAR

# Check current BAR size
$ nvidia-smi --query-gpu=bar1.total --format=csv,noheader
256 MiB    # Too small for P2P!

# Enable in BIOS:
# 1. Advanced → PCI Subsystem Settings → Above 4G Decoding: Enabled
# 2. Advanced → PCI Subsystem Settings → Resizable BAR Support: Enabled

# After reboot, verify
$ nvidia-smi --query-gpu=bar1.total --format=csv,noheader
131072 MiB   # 128GB - P2P enabled!

# Kernel parameter for older systems
GRUB_CMDLINE_LINUX="pci=realloc,assign-busses"

IOMMU & VFIO Deep Dive

🚨 IOMMU: Friend or Foe? IOMMU provides memory isolation but can BLOCK P2P transfers. Understanding IOMMU groups and bypass mechanisms is critical for GDS deployments.

IOMMU Modes for GPU-Storage

Mode	P2P Status	Security	Use Case
IOMMU Off	Works	None	Dedicated bare-metal
IOMMU Passthrough	Works	Partial	Recommended for GDS
IOMMU Strict	Blocked	Full	Multi-tenant, VMs
VFIO + P2P	Requires setup	Full	VM passthrough with P2P

Bash - IOMMU Configuration

# Check IOMMU status
$ dmesg | grep -i iommu
[    0.000000] DMAR: IOMMU enabled
[    0.123456] AMD-Vi: IOMMU performance counters supported

# View IOMMU groups (devices in same group can P2P)
$ for d in /sys/kernel/iommu_groups/*/devices/*; do
    n=$(basename $(dirname $(dirname $d)))
    echo "IOMMU Group $n: $(lspci -nns ${d##*/})"
done | grep -E "NVIDIA|NVMe"

IOMMU Group 15: 41:00.0 3D controller: NVIDIA H100 [10de:2330]
IOMMU Group 15: 42:00.0 NVMe: Samsung PM9A3 [144d:a80a]   # Same group = P2P OK
IOMMU Group 28: 81:00.0 NVMe: Intel P5800X [8086:0a54]    # Different group!

# Enable IOMMU passthrough mode (recommended for GDS)
# /etc/default/grub
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"   # Intel
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"     # AMD

# Apply and reboot
$ sudo update-grub && sudo reboot

Bash - VFIO Setup for GPU Passthrough with P2P

# Bind GPU and NVMe to VFIO (for VM passthrough)

# 1. Load VFIO modules
$ sudo modprobe vfio-pci

# 2. Unbind from native drivers
$ echo "0000:41:00.0" | sudo tee /sys/bus/pci/devices/0000:41:00.0/driver/unbind
$ echo "0000:42:00.0" | sudo tee /sys/bus/pci/devices/0000:42:00.0/driver/unbind

# 3. Bind to VFIO
$ echo "10de 2330" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id  # GPU
$ echo "144d a80a" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id  # NVMe

# 4. Verify VFIO binding
$ ls -la /dev/vfio/
total 0
drwxr-xr-x  2 root root      80 Dec 29 10:00 .
drwxr-xr-x 21 root root    4200 Dec 29 10:00 ..
crw-------  1 root root 243, 0 Dec 29 10:00 15   # IOMMU group 15
crw-rw-rw-  1 root root 10, 196 Dec 29 10:00 vfio

# 5. QEMU/KVM command for passthrough with P2P
$ qemu-system-x86_64 \
    -device vfio-pci,host=41:00.0,multifunction=on \
    -device vfio-pci,host=42:00.0 \
    -machine q35,accel=kvm,kernel-irqchip=split \
    -cpu host \
    ...

⚡ P2P in VMs: The Trick

GPU and NVMe must be in the same IOMMU group
Use pcie_acs_override if they're not (security trade-off)
Pass both devices to the same VM
Inside VM: standard GDS setup works

Bash - Debugging P2P Failures

# P2P diagnostic checklist

# 1. Check if P2P is possible
$ nvidia-smi topo -p2p r
        GPU0    nvme0n1
GPU0     X      OK      # OK = P2P possible
nvme0   OK       X       # NS = Not Supported

# 2. If "NS", check IOMMU groups
$ find /sys/kernel/iommu_groups -name "41:00.0" -o -name "42:00.0"
/sys/kernel/iommu_groups/15/devices/0000:41:00.0
/sys/kernel/iommu_groups/28/devices/0000:42:00.0   # Different groups!

# 3. Check ACS on path
$ lspci -vvv -s 00:01.0 | grep -A10 "Access Control"
    ACSCtl: SrcValid+ TransBlk+   # Enabled = blocks P2P

# 4. Check BAR sizes
$ nvidia-smi --query-gpu=bar1.total --format=csv
131072 MiB   # Need >256MB for P2P

# 5. Check PCIe link
$ lspci -vvv -s 41:00.0 | grep -E "LnkCap|LnkSta"
    LnkCap: Speed 16GT/s, Width x16
    LnkSta: Speed 16GT/s, Width x16   # Matches = good

# 6. GDS-specific check
$ /usr/local/cuda/gds/tools/gdscheck -p
GDS cuFile configuration:
    Properties File: /etc/cufile.json
    Platform compatibility: SUPPORTED
    P2P support: ENABLED

Multi-Vendor GPU Storage

⚡ Beyond NVIDIA While GDS is NVIDIA-specific, AMD and Intel GPUs have their own direct storage paths. Production deployments increasingly need vendor-neutral strategies.

AMD ROCm vs NVIDIA GDS

Feature	NVIDIA GDS	AMD ROCm
Direct Storage API	cuFile API	hipMemcpy + RDMA
P2P DMA	GPUDirect Storage	PCIe P2P (ROCm 5.0+)
RDMA Support	GPUDirect RDMA	ROCm RDMA
Max Throughput	~14 GB/s (H100)	~12 GB/s (MI300X)

Intel GPU Storage

Intel GPU	Storage Path	Status
Data Center GPU Max 1550	oneAPI Level Zero + DMAbuf	Limited support
Flex Series	Standard PCIe DMA	Available
Arc (Consumer)	System memory only	Not supported

Vendor-Neutral Strategy

Python

# RAPIDS kvikio: Vendor-neutral GPU I/O library
import kvikio

# Works across GPU vendors with appropriate backend
with kvikio.CuFile("/data/model.bin", "r") as f:
    # Automatically uses GDS on NVIDIA, fallback on others
    data = f.read(gpu_buffer)

GPU Interconnect Technologies

Technology	Bandwidth	Latency	Scope	Storage Use
NVLink 4.0	900 GB/s (bidirectional)	~0.5-1 µs	GPU-to-GPU	Tensor transfer
NVSwitch	900 GB/s all-to-all	~1 µs	Intra-node fabric	Checkpoint aggregation
PCIe Gen5 x16	64 GB/s	~2-3 µs	GPU↔NVMe, GPU↔NIC	GDS, GPUDirect RDMA
InfiniBand NDR	400 Gbps (50 GB/s)	~1-2 µs	Inter-node	NVMe-oF, RDMA storage
RoCEv2	400 Gbps	~2-5 µs	Inter-node (Ethernet)	NVMe-oF, cloud

GPUDirect Technology Stack

GPUDirect Storage LOCAL

Direct DMA between local NVMe and GPU memory. Minimal CPU in bulk transfer path after setup.

Use case: Checkpoint read/write, dataset loading

GPUDirect RDMA NETWORK

Direct DMA between GPU memory and remote node via InfiniBand/RoCE.

Use case: NCCL collectives, distributed checkpointing

GPUDirect P2P INTRA-NODE

Direct GPU↔GPU transfers via NVLink or PCIe.

Use case: Tensor parallelism, pipeline parallelism

Distributed Checkpoint Patterns

Pattern 1: Aggregated Checkpointing

All ranks write to rank 0, which writes to storage. Recommended for large models.

Python

def save_checkpoint_aggregated(model, optimizer, path):
    state = {
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
    }
    
    if dist.get_rank() == 0:
        # Only rank 0 writes - avoids storage contention
        torch.save(state, path)
    
    dist.barrier()  # Synchronize all ranks

Pattern 2: Sharded Checkpointing

Each rank writes its own shard. For very large models with parallel writes.

Python

def save_checkpoint_sharded(model, optimizer, base_path):
    rank = dist.get_rank()
    shard_path = f"{base_path}/shard_{rank}.pt"
    
    # Each GPU writes its local shard
    state = {
        'model_shard': get_local_state(model),
        'optimizer_shard': get_local_optimizer_state(optimizer),
    }
    
    # Use GDS for direct GPU → NVMe write
    with cufile.open(shard_path, 'wb') as f:
        f.write(serialize(state))
    
    dist.barrier()

Cluster Storage Topology

Best Practice Use local NVMe with GDS for training data (maximum speed), and shared storage (Lustre/WekaFS) for checkpoints (durability and accessibility across nodes).