C.3 Infrastructure & Deployment | GPU-NVMe Deep Dive

1

NVMe over Fabrics (NVMe-oF)

🚨 Why This Matters Hyperscalers (Meta, Google, Microsoft) don't use local NVMe. They use NVMe-oF to disaggregate storage from compute. If you're building at scale, you need to understand fabric-attached storage.

NVMe-oF Transport Options

Transport	Latency	Throughput	CPU	GPU-Direct	Use Case
NVMe/RDMA (RoCEv2)	~10-20 µs	100-400 Gbps	Very Low	Yes	High-perf AI clusters
NVMe/RDMA (IB)	~5-15 µs	200-400 Gbps	Very Low	Yes	HPC, premium AI
NVMe/TCP	~50-100 µs	25-100 Gbps	High	Limited	Cost-sensitive
NVMe/FC	~30-50 µs	32-64 Gbps	Medium	No	Legacy FC infra

ANA Multipathing

⚡ Production Requirement Any serious NVMe-oF deployment needs multipathing for HA. ANA provides path states (optimized, non-optimized, inaccessible) so the initiator can choose the best path.

Bash

# Check NVMe-oF multipath status
$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2024-01.com.vendor:array01
\
 +- nvme0 rdma traddr=192.168.1.10 trsvcid=4420 live optimized
 +- nvme1 rdma traddr=192.168.1.11 trsvcid=4420 live non-optimized

# ANA states:
# - optimized: preferred path, lowest latency
# - non-optimized: functional but higher latency
# - inaccessible: path down, don't use

# Linux native multipath (dm-multipath not needed)
$ cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy
round-robin   # or numa, queue-depth

GDS over NVMe-oF Configuration

JSON

// cuFile.json configuration for fabric storage
{
  "logging": { "level": 2 },
  "nvfs": {
    "rdma": {
      "enable": true,
      "devices": ["mlx5_0", "mlx5_1"],
      "poll_mode": true,
      "max_direct_io_size_kb": 16384
    }
  },
  "properties": {
    "max_device_cache_size_kb": 131072,
    "max_device_pinned_mem_size_kb": 33554432
  }
}

2

Linux Storage Stack Deep Dive

⚠️ The Hidden Bottleneck You optimized your NVMe, enabled GDS, bought expensive SSDs... but you're still going through the kernel. The Linux storage stack adds 5-50µs of latency and significant CPU overhead.

Storage Stack Layers

Linux Storage Stack: Latency at Each Layer

Application (cuFile)

~1 µs

↓

VFS (Virtual File System)

~2-5 µs

↓

File System (XFS/ext4)

~3-10 µs

↓

Block Layer (blk-mq)

~2-5 µs

↓

NVMe Driver

~1-2 µs

↓

NVMe SSD

~80-100 µs

Total kernel overhead: 10-25 µs (10-25% of total I/O time)

io_uring: Modern Async I/O

✅ io_uring Benefits Submission Queue (SQ) and Completion Queue (CQ) in shared memory. Zero-copy. Batched submissions. Kernel-poll mode can reduce syscalls/context switches.

Traditional I/O

1 syscall per I/O operation
Context switch overhead (~1-2 µs)
Data copy: user → kernel → device
Completion: poll or signal
~500K IOPS max per core

io_uring

Batch N ops, 1 syscall (or 0 with SQPOLL)
Shared memory SQ/CQ, no copies
Kernel polling mode: zero syscalls
Registered buffers: avoids extra memcpy (direct DMA when possible)
~2-3M IOPS per core

C

// io_uring with registered buffers for GPU-like access patterns

struct io_uring ring;
io_uring_queue_init(256, &ring, IORING_SETUP_SQPOLL);  // Kernel polling

// Register fixed buffers (avoids per-I/O buffer registration)
struct iovec iovecs[16];
for (int i = 0; i < 16; i++) {
    iovecs[i].iov_base = aligned_alloc(4096, BUFFER_SIZE);
    iovecs[i].iov_len = BUFFER_SIZE;
}
io_uring_register_buffers(&ring, iovecs, 16);

// Submit batched reads
for (int i = 0; i < batch_size; i++) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read_fixed(sqe, fd, iovecs[i % 16].iov_base, 
                              size, offset, i % 16);
    sqe->user_data = i;
}
io_uring_submit(&ring);  // Single syscall for all

⚠️ SQPOLL Requirements

CAP_SYS_NICE required: SQPOLL spawns a kernel thread that busy-polls
CPU dedication: The sq_thread pins to a CPU core (100% utilized)
Idle timeout: After sq_thread_idle ms, thread sleeps (default 1000ms)
Not always faster: For bursty GPU checkpoints, regular io_uring may match SQPOLL

io_uring Modes Comparison

Mode	Syscalls/batch	CPU Overhead	Best For
Regular	1 submit + 1 wait	Low	General purpose
SQPOLL	0	High (dedicated core)	Sustained high IOPS
IOPOLL	1 submit, 0 complete	Medium	NVMe with polling
SQPOLL + IOPOLL	0	Very High	Ultra-low latency

SPDK: Complete Kernel Bypass

🔧 SPDK (Storage Performance Development Kit) User-space NVMe driver. Completely bypasses the kernel. Polls NVMe completion queues directly. Used by Ceph, DAOS, and HFT systems.

Kernel Path

Application

↓ syscall

Kernel (VFS/FS/Block)

↓

NVMe Driver

↓

NVMe SSD

~100-120 µs

SPDK Path

Application

↓ function call

SPDK NVMe (user-space)

↓ VFIO/UIO

NVMe SSD

~85-95 µs

⚠️ SPDK + GPU Caveat SPDK doesn't natively integrate with GDS. You'd need to manage GPU memory registration yourself. For most GPU workloads, GDS with io_uring is more practical.

3

Kubernetes & CSI Integration

🚨 Reality Check 80%+ of AI workloads deploy in Kubernetes. If GDS doesn't work in your container orchestration, it doesn't work in production.

GDS in Containers: Challenges

Requirement	Challenge	Solution
GPU Access	Container needs GPU device	NVIDIA Device Plugin
NVMe Access	Raw NVMe access for GDS	Privileged OR device plugin
RDMA Access	RDMA device for GPUDirect	RDMA device plugin, host network
Huge Pages	GDS uses huge pages for DMA	hugePages resource request
File System	GDS needs specific mount opts	CSI driver with GDS provisioning

Pod Spec for GDS Workloads

YAML

# GDS-enabled AI training pod
apiVersion: v1
kind: Pod
metadata:
  name: gds-training-pod
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 8
        hugepages-2Mi: 4Gi         # Required for GDS DMA
        rdma/rdma_shared_device_a: 1  # GPUDirect RDMA
    volumeMounts:
    - name: training-data
      mountPath: /data
    - name: nvme-direct        # Raw NVMe for GDS
      mountPath: /dev/nvme0n1
    securityContext:
      privileged: true         # Required for device access
      capabilities:
        add: [SYS_ADMIN, IPC_LOCK]
  
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: gds-pvc
  - name: nvme-direct
    hostPath:
      path: /dev/nvme0n1
      type: BlockDevice

CSI Drivers for GPU Storage

NVIDIA GPUDirect Storage CSI Recommended

Official NVIDIA CSI driver with GDS support. Handles device registration, huge pages, and mount options automatically.

Auto-registers NVMe with GDS
Handles cuFile config injection
Works with local and NVMe-oF
Requires NVIDIA GPU Operator

Dell CSI PowerScale Enterprise

Enterprise storage arrays with NVMe-oF backend. CSI driver handles multipathing and GDS compatibility.

NVMe-oF/RDMA backend
ANA multipathing built-in
Snapshot and clone support
Enterprise support contract

Pure Storage CSI Enterprise

FlashArray and FlashBlade with NVMe-oF. DirectPath for kernel bypass.

DirectPath I/O (reduced latency)
NVMe/RoCE and NVMe/TCP
GPUDirect Storage certified
Kubernetes-native management

OpenEBS Mayastor Open Source

Cloud-native storage with NVMe-oF backend. Good for on-prem Kubernetes clusters.

NVMe-oF/TCP based
Replication and snapshots
No GDS optimization (yet)
CNCF Sandbox project

StorageClass for GDS

YAML

# StorageClass optimized for GDS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gds-nvme
provisioner: csi.nvidia.com
parameters:
  type: nvme-local
  fsType: xfs
  mkfsOptions: "-K"           # Don't discard on mkfs
mountOptions:
  - noatime                        # Don't update access times
  - nodiratime
  - nobarrier                      # SSD doesn't need barriers
  - logbufs=8                      # XFS log buffers
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

4

File Systems for GDS

GDS Compatibility Matrix

File System	GDS Support	Notes
ext4	✓ Full Support	Most common. Mount: `-o nobarrier,data=ordered`
XFS	✓ Full Support	Better for large files, parallel I/O. Recommended for checkpoints.
Lustre	✓ Full Support	Parallel FS. Requires lustre-client-gds. HPC standard.
GPFS / Spectrum Scale	✓ Full Support	Enterprise parallel FS. Native GDS (v5.1+). IBM AI infra.
WekaFS	✓ Full Support	Purpose-built for AI. Native GDS. Highest performance.
BeeGFS	⚠ Partial	Requires tuning. Check version compatibility.
NFS	✗ Not Supported	No O_DIRECT to NFS. Use NVMe-oF instead.
CIFS/SMB	✗ Not Supported	Windows protocol. No GDS path.

O_DIRECT Requirement

⚠️ Critical GDS requires O_DIRECT to bypass the page cache. Files must be opened with O_DIRECT flag, and I/O must be aligned.

C

// O_DIRECT requirements
int fd = open(path, O_RDONLY | O_DIRECT);

// Alignment requirements (typically 4KB)
size_t alignment = 4096;
void* buffer;
posix_memalign(&buffer, alignment, size);

// Size must be multiple of alignment
size = (size + alignment - 1) & ~(alignment - 1);

💡 Tip Use cuFileDriverGetProperties() to query the required alignment for your system.

Parallel Filesystem Tuning

Bash

# Lustre tuning for GDS
lctl set_param llite.*.max_read_ahead_mb=0      # Disable readahead (GDS handles)
lctl set_param osc.*.max_pages_per_rpc=1024     # Larger RPCs
lctl set_param osc.*.max_rpcs_in_flight=32      # More parallelism

# XFS tuning
mkfs.xfs -d su=1m,sw=4 /dev/nvme0n1              # Stripe unit/width for RAID
mount -o noatime,nodiratime,logbufs=8 /dev/nvme0n1 /data