Appendix B.2: PCIe Topology

The PCIe Hierarchy

Root Complex (RC)

CPU-integrated PCIe controller. All transactions originate or terminate here unless using P2P.

PCIe Switch

Expands PCIe lanes. Critical feature: can route P2P transactions between ports WITHOUT going to CPU.

Endpoint

Leaf device (GPU, NVMe, NIC). Has BARs (Base Address Registers) for memory-mapped I/O.

Upstream Port

Switch port facing toward the root complex (CPU).

Downstream Port

Switch port facing toward endpoints (devices).

Peer-to-Peer (P2P) Transfers

P2P allows devices to transfer data directly, avoiding extra copies through CPU/system memory. This is critical for GPUDirect Storage.

❌ Without P2P

NVMe → CPU Memory → GPU

2 PCIe traversals
2 memory copies
CPU involved

✓ With P2P

NVMe → PCIe Switch → GPU

1 PCIe traversal
0 memory copies
CPU bypassed in bulk data path (control stays on CPU)

✓ P2P Works When:

GPU and NVMe are under the same PCIe switch
ACS (Access Control Services) is disabled on the path
IOMMU allows the transaction (or is disabled)
Both devices support P2P (most modern ones do)

ACS: The P2P Killer

⚠️ Access Control Services (ACS)

ACS can prevent endpoint-to-endpoint P2P on some platforms depending on ACSCtl bits and topology.

Check ACS Status

$ lspci -vvv | grep -i "ACS"

    Capabilities: [100] Access Control Services
        ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
            

ACSCtl shows - for all = ACS disabled (good for P2P). If you see +, P2P won't work.

Disable ACS (If Needed)

# Add to kernel command line (GRUB)
pci=noacs

# Or disable at runtime for specific device
setpci -s 0000:3a:00.0 ECAP_ACS+6.w=0000
            

💡 Production Note

In virtualized environments (VMs, containers with device passthrough), you may need ACS for security. This creates a fundamental tension: security vs. performance. Some organizations use dedicated bare-metal nodes for GPU training to avoid this tradeoff.

Discovering Your Topology

nvidia-smi topo

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    NVMe0   NVMe1   CPU
GPU0     X      NV12    SYS     SYS     PHB     NODE    SYS
GPU1    NV12     X      SYS     SYS     NODE    PHB     SYS
GPU2    SYS     SYS      X      NV12    SYS     SYS     PHB
GPU3    SYS     SYS     NV12     X      SYS     SYS     PHB
            

PHB

PCIe Host Bridge — same CPU socket, different root port. P2P possible.

NODE

Same NUMA node. May involve PCIe switch. Check with lspci.

SYS

Cross-socket via QPI/UPI. P2P unlikely to work efficiently.

NV##

NVLink connection (GPU-to-GPU only). 12 = NVLink Gen4 bidirectional.

lspci Tree View

$ lspci -tv
-[0000:00]-+-00.0  Intel Root Complex
           +-1c.0-[01-02]----00.0  PCIe Switch
           |             \-[02]--+-00.0  NVIDIA GPU
           |                     \-01.0  Samsung NVMe
           \-1d.0-[03]----00.0  Intel NIC
            

GPU and NVMe under same switch [01-02] = P2P should work (if ACS disabled).

NUMA and Storage Affinity

In multi-socket systems, each CPU has its own PCIe lanes. Accessing storage on the "wrong" NUMA node adds latency.

✓ Local Access

GPU → Local PCIe → NVMe

~2-3 μs latency

✗ Remote Access

GPU → QPI/UPI → Remote PCIe → NVMe

~5-8 μs latency (+100-200%)

# Check NUMA node for a device
$ cat /sys/bus/pci/devices/0000:3a:00.0/numa_node
0

# Check which GPUs are on which NUMA node
$ nvidia-smi topo -m | head -1
            

💡 Best Practice

For GPUDirect Storage, ensure GPU and NVMe are on the same NUMA node AND under the same PCIe switch. This is a hardware/BIOS configuration decision — plan it before deployment.