Main A: GPU B: NVMe C: Production

The PCIe Hierarchy

CPU + Root Complex PCIe Switch PCIe Switch GPU 0 NVMe 0 GPU 1 NVMe 1 P2P OK ✓ P2P OK ✓ Cross-switch P2P: Goes through CPU ✗
Root Complex (RC)
CPU-integrated PCIe controller. All transactions originate or terminate here unless using P2P.
PCIe Switch
Expands PCIe lanes. Critical feature: can route P2P transactions between ports WITHOUT going to CPU.
Endpoint
Leaf device (GPU, NVMe, NIC). Has BARs (Base Address Registers) for memory-mapped I/O.
Upstream Port
Switch port facing toward the root complex (CPU).
Downstream Port
Switch port facing toward endpoints (devices).

Peer-to-Peer (P2P) Transfers

P2P allows devices to transfer data directly, avoiding extra copies through CPU/system memory. This is critical for GPUDirect Storage.

❌ Without P2P
NVMe → CPU Memory → GPU

2 PCIe traversals
2 memory copies
CPU involved
✓ With P2P
NVMe → PCIe Switch → GPU

1 PCIe traversal
0 memory copies
CPU bypassed in bulk data path (control stays on CPU)
✓ P2P Works When:
  • GPU and NVMe are under the same PCIe switch
  • ACS (Access Control Services) is disabled on the path
  • IOMMU allows the transaction (or is disabled)
  • Both devices support P2P (most modern ones do)

ACS: The P2P Killer

⚠️ Access Control Services (ACS)

ACS can prevent endpoint-to-endpoint P2P on some platforms depending on ACSCtl bits and topology.

Check ACS Status

$ lspci -vvv | grep -i "ACS" Capabilities: [100] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

ACSCtl shows - for all = ACS disabled (good for P2P). If you see +, P2P won't work.

Disable ACS (If Needed)

# Add to kernel command line (GRUB) pci=noacs # Or disable at runtime for specific device setpci -s 0000:3a:00.0 ECAP_ACS+6.w=0000
💡 Production Note

In virtualized environments (VMs, containers with device passthrough), you may need ACS for security. This creates a fundamental tension: security vs. performance. Some organizations use dedicated bare-metal nodes for GPU training to avoid this tradeoff.

Discovering Your Topology

nvidia-smi topo

$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 NVMe0 NVMe1 CPU GPU0 X NV12 SYS SYS PHB NODE SYS GPU1 NV12 X SYS SYS NODE PHB SYS GPU2 SYS SYS X NV12 SYS SYS PHB GPU3 SYS SYS NV12 X SYS SYS PHB
PHB
PCIe Host Bridge — same CPU socket, different root port. P2P possible.
NODE
Same NUMA node. May involve PCIe switch. Check with lspci.
SYS
Cross-socket via QPI/UPI. P2P unlikely to work efficiently.
NV##
NVLink connection (GPU-to-GPU only). 12 = NVLink Gen4 bidirectional.

lspci Tree View

$ lspci -tv -[0000:00]-+-00.0 Intel Root Complex +-1c.0-[01-02]----00.0 PCIe Switch | \-[02]--+-00.0 NVIDIA GPU | \-01.0 Samsung NVMe \-1d.0-[03]----00.0 Intel NIC

GPU and NVMe under same switch [01-02] = P2P should work (if ACS disabled).

NUMA and Storage Affinity

In multi-socket systems, each CPU has its own PCIe lanes. Accessing storage on the "wrong" NUMA node adds latency.

✓ Local Access

GPU → Local PCIe → NVMe

~2-3 μs latency

✗ Remote Access

GPU → QPI/UPI → Remote PCIe → NVMe

~5-8 μs latency (+100-200%)

# Check NUMA node for a device $ cat /sys/bus/pci/devices/0000:3a:00.0/numa_node 0 # Check which GPUs are on which NUMA node $ nvidia-smi topo -m | head -1
💡 Best Practice

For GPUDirect Storage, ensure GPU and NVMe are on the same NUMA node AND under the same PCIe switch. This is a hardware/BIOS configuration decision — plan it before deployment.