1
NVMe over Fabrics (NVMe-oF)
🚨 Why This Matters
Hyperscalers (Meta, Google, Microsoft) don't use local NVMe. They use NVMe-oF to disaggregate storage from compute. If you're building at scale, you need to understand fabric-attached storage.
NVMe-oF Transport Options
| Transport | Latency | Throughput | CPU | GPU-Direct | Use Case |
|---|---|---|---|---|---|
| NVMe/RDMA (RoCEv2) | ~10-20 µs | 100-400 Gbps | Very Low | Yes | High-perf AI clusters |
| NVMe/RDMA (IB) | ~5-15 µs | 200-400 Gbps | Very Low | Yes | HPC, premium AI |
| NVMe/TCP | ~50-100 µs | 25-100 Gbps | High | Limited | Cost-sensitive |
| NVMe/FC | ~30-50 µs | 32-64 Gbps | Medium | No | Legacy FC infra |
ANA Multipathing
⚡ Production Requirement
Any serious NVMe-oF deployment needs multipathing for HA. ANA provides path states (optimized, non-optimized, inaccessible) so the initiator can choose the best path.
Bash
# Check NVMe-oF multipath status $ nvme list-subsys nvme-subsys0 - NQN=nqn.2024-01.com.vendor:array01 \ +- nvme0 rdma traddr=192.168.1.10 trsvcid=4420 live optimized +- nvme1 rdma traddr=192.168.1.11 trsvcid=4420 live non-optimized # ANA states: # - optimized: preferred path, lowest latency # - non-optimized: functional but higher latency # - inaccessible: path down, don't use # Linux native multipath (dm-multipath not needed) $ cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy round-robin # or numa, queue-depth
GDS over NVMe-oF Configuration
JSON
// cuFile.json configuration for fabric storage { "logging": { "level": 2 }, "nvfs": { "rdma": { "enable": true, "devices": ["mlx5_0", "mlx5_1"], "poll_mode": true, "max_direct_io_size_kb": 16384 } }, "properties": { "max_device_cache_size_kb": 131072, "max_device_pinned_mem_size_kb": 33554432 } }
2
Linux Storage Stack Deep Dive
⚠️ The Hidden Bottleneck
You optimized your NVMe, enabled GDS, bought expensive SSDs... but you're still going through the kernel. The Linux storage stack adds 5-50µs of latency and significant CPU overhead.
Storage Stack Layers
Linux Storage Stack: Latency at Each Layer
Application (cuFile)
~1 µs
VFS (Virtual File System)
~2-5 µs
File System (XFS/ext4)
~3-10 µs
Block Layer (blk-mq)
~2-5 µs
NVMe Driver
~1-2 µs
NVMe SSD
~80-100 µs
Total kernel overhead: 10-25 µs (10-25% of total I/O time)
io_uring: Modern Async I/O
✅ io_uring Benefits
Submission Queue (SQ) and Completion Queue (CQ) in shared memory. Zero-copy. Batched submissions. Kernel-poll mode can reduce syscalls/context switches.
Traditional I/O
- 1 syscall per I/O operation
- Context switch overhead (~1-2 µs)
- Data copy: user → kernel → device
- Completion: poll or signal
- ~500K IOPS max per core
io_uring
- Batch N ops, 1 syscall (or 0 with SQPOLL)
- Shared memory SQ/CQ, no copies
- Kernel polling mode: zero syscalls
- Registered buffers: avoids extra memcpy (direct DMA when possible)
- ~2-3M IOPS per core
C
// io_uring with registered buffers for GPU-like access patterns struct io_uring ring; io_uring_queue_init(256, &ring, IORING_SETUP_SQPOLL); // Kernel polling // Register fixed buffers (avoids per-I/O buffer registration) struct iovec iovecs[16]; for (int i = 0; i < 16; i++) { iovecs[i].iov_base = aligned_alloc(4096, BUFFER_SIZE); iovecs[i].iov_len = BUFFER_SIZE; } io_uring_register_buffers(&ring, iovecs, 16); // Submit batched reads for (int i = 0; i < batch_size; i++) { struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); io_uring_prep_read_fixed(sqe, fd, iovecs[i % 16].iov_base, size, offset, i % 16); sqe->user_data = i; } io_uring_submit(&ring); // Single syscall for all
⚠️ SQPOLL Requirements
- CAP_SYS_NICE required: SQPOLL spawns a kernel thread that busy-polls
- CPU dedication: The sq_thread pins to a CPU core (100% utilized)
- Idle timeout: After sq_thread_idle ms, thread sleeps (default 1000ms)
- Not always faster: For bursty GPU checkpoints, regular io_uring may match SQPOLL
io_uring Modes Comparison
| Mode | Syscalls/batch | CPU Overhead | Best For |
|---|---|---|---|
| Regular | 1 submit + 1 wait | Low | General purpose |
| SQPOLL | 0 | High (dedicated core) | Sustained high IOPS |
| IOPOLL | 1 submit, 0 complete | Medium | NVMe with polling |
| SQPOLL + IOPOLL | 0 | Very High | Ultra-low latency |
SPDK: Complete Kernel Bypass
🔧 SPDK (Storage Performance Development Kit)
User-space NVMe driver. Completely bypasses the kernel. Polls NVMe completion queues directly. Used by Ceph, DAOS, and HFT systems.
Kernel Path
Application
↓ syscall
Kernel (VFS/FS/Block)
↓
NVMe Driver
↓
NVMe SSD
~100-120 µs
SPDK Path
Application
↓ function call
SPDK NVMe (user-space)
↓ VFIO/UIO
NVMe SSD
~85-95 µs
⚠️ SPDK + GPU Caveat
SPDK doesn't natively integrate with GDS. You'd need to manage GPU memory registration yourself. For most GPU workloads, GDS with io_uring is more practical.
3
Kubernetes & CSI Integration
🚨 Reality Check
80%+ of AI workloads deploy in Kubernetes. If GDS doesn't work in your container orchestration, it doesn't work in production.
GDS in Containers: Challenges
| Requirement | Challenge | Solution |
|---|---|---|
| GPU Access | Container needs GPU device | NVIDIA Device Plugin |
| NVMe Access | Raw NVMe access for GDS | Privileged OR device plugin |
| RDMA Access | RDMA device for GPUDirect | RDMA device plugin, host network |
| Huge Pages | GDS uses huge pages for DMA | hugePages resource request |
| File System | GDS needs specific mount opts | CSI driver with GDS provisioning |
Pod Spec for GDS Workloads
YAML
# GDS-enabled AI training pod apiVersion: v1 kind: Pod metadata: name: gds-training-pod spec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.01-py3 resources: limits: nvidia.com/gpu: 8 hugepages-2Mi: 4Gi # Required for GDS DMA rdma/rdma_shared_device_a: 1 # GPUDirect RDMA volumeMounts: - name: training-data mountPath: /data - name: nvme-direct # Raw NVMe for GDS mountPath: /dev/nvme0n1 securityContext: privileged: true # Required for device access capabilities: add: [SYS_ADMIN, IPC_LOCK] volumes: - name: training-data persistentVolumeClaim: claimName: gds-pvc - name: nvme-direct hostPath: path: /dev/nvme0n1 type: BlockDevice
CSI Drivers for GPU Storage
NVIDIA GPUDirect Storage CSI Recommended
Official NVIDIA CSI driver with GDS support. Handles device registration, huge pages, and mount options automatically.
- Auto-registers NVMe with GDS
- Handles cuFile config injection
- Works with local and NVMe-oF
- Requires NVIDIA GPU Operator
Dell CSI PowerScale Enterprise
Enterprise storage arrays with NVMe-oF backend. CSI driver handles multipathing and GDS compatibility.
- NVMe-oF/RDMA backend
- ANA multipathing built-in
- Snapshot and clone support
- Enterprise support contract
Pure Storage CSI Enterprise
FlashArray and FlashBlade with NVMe-oF. DirectPath for kernel bypass.
- DirectPath I/O (reduced latency)
- NVMe/RoCE and NVMe/TCP
- GPUDirect Storage certified
- Kubernetes-native management
OpenEBS Mayastor Open Source
Cloud-native storage with NVMe-oF backend. Good for on-prem Kubernetes clusters.
- NVMe-oF/TCP based
- Replication and snapshots
- No GDS optimization (yet)
- CNCF Sandbox project
StorageClass for GDS
YAML
# StorageClass optimized for GDS apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gds-nvme provisioner: csi.nvidia.com parameters: type: nvme-local fsType: xfs mkfsOptions: "-K" # Don't discard on mkfs mountOptions: - noatime # Don't update access times - nodiratime - nobarrier # SSD doesn't need barriers - logbufs=8 # XFS log buffers volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true
4
File Systems for GDS
GDS Compatibility Matrix
| File System | GDS Support | Notes |
|---|---|---|
| ext4 | ✓ Full Support | Most common. Mount: -o nobarrier,data=ordered |
| XFS | ✓ Full Support | Better for large files, parallel I/O. Recommended for checkpoints. |
| Lustre | ✓ Full Support | Parallel FS. Requires lustre-client-gds. HPC standard. |
| GPFS / Spectrum Scale | ✓ Full Support | Enterprise parallel FS. Native GDS (v5.1+). IBM AI infra. |
| WekaFS | ✓ Full Support | Purpose-built for AI. Native GDS. Highest performance. |
| BeeGFS | ⚠ Partial | Requires tuning. Check version compatibility. |
| NFS | ✗ Not Supported | No O_DIRECT to NFS. Use NVMe-oF instead. |
| CIFS/SMB | ✗ Not Supported | Windows protocol. No GDS path. |
O_DIRECT Requirement
⚠️ Critical
GDS requires O_DIRECT to bypass the page cache. Files must be opened with O_DIRECT flag, and I/O must be aligned.
C
// O_DIRECT requirements int fd = open(path, O_RDONLY | O_DIRECT); // Alignment requirements (typically 4KB) size_t alignment = 4096; void* buffer; posix_memalign(&buffer, alignment, size); // Size must be multiple of alignment size = (size + alignment - 1) & ~(alignment - 1);
💡 Tip
Use
cuFileDriverGetProperties() to query the required alignment for your system.
Parallel Filesystem Tuning
Bash
# Lustre tuning for GDS lctl set_param llite.*.max_read_ahead_mb=0 # Disable readahead (GDS handles) lctl set_param osc.*.max_pages_per_rpc=1024 # Larger RPCs lctl set_param osc.*.max_rpcs_in_flight=32 # More parallelism # XFS tuning mkfs.xfs -d su=1m,sw=4 /dev/nvme0n1 # Stripe unit/width for RAID mount -o noatime,nodiratime,logbufs=8 /dev/nvme0n1 /data