Main A: GPU B: NVMe C: Production
C.3 • Deploy Phase

Infrastructure & Deployment

NVMe-oF fabrics, Linux storage stack deep dive, io_uring optimization, Kubernetes CSI integration, and filesystem compatibility.

1

NVMe over Fabrics (NVMe-oF)

🚨 Why This Matters Hyperscalers (Meta, Google, Microsoft) don't use local NVMe. They use NVMe-oF to disaggregate storage from compute. If you're building at scale, you need to understand fabric-attached storage.

NVMe-oF Transport Options

Transport Latency Throughput CPU GPU-Direct Use Case
NVMe/RDMA (RoCEv2) ~10-20 µs 100-400 Gbps Very Low Yes High-perf AI clusters
NVMe/RDMA (IB) ~5-15 µs 200-400 Gbps Very Low Yes HPC, premium AI
NVMe/TCP ~50-100 µs 25-100 Gbps High Limited Cost-sensitive
NVMe/FC ~30-50 µs 32-64 Gbps Medium No Legacy FC infra

ANA Multipathing

⚡ Production Requirement Any serious NVMe-oF deployment needs multipathing for HA. ANA provides path states (optimized, non-optimized, inaccessible) so the initiator can choose the best path.
Bash
# Check NVMe-oF multipath status
$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2024-01.com.vendor:array01
\
 +- nvme0 rdma traddr=192.168.1.10 trsvcid=4420 live optimized
 +- nvme1 rdma traddr=192.168.1.11 trsvcid=4420 live non-optimized

# ANA states:
# - optimized: preferred path, lowest latency
# - non-optimized: functional but higher latency
# - inaccessible: path down, don't use

# Linux native multipath (dm-multipath not needed)
$ cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy
round-robin   # or numa, queue-depth

GDS over NVMe-oF Configuration

JSON
// cuFile.json configuration for fabric storage
{
  "logging": { "level": 2 },
  "nvfs": {
    "rdma": {
      "enable": true,
      "devices": ["mlx5_0", "mlx5_1"],
      "poll_mode": true,
      "max_direct_io_size_kb": 16384
    }
  },
  "properties": {
    "max_device_cache_size_kb": 131072,
    "max_device_pinned_mem_size_kb": 33554432
  }
}
2

Linux Storage Stack Deep Dive

⚠️ The Hidden Bottleneck You optimized your NVMe, enabled GDS, bought expensive SSDs... but you're still going through the kernel. The Linux storage stack adds 5-50µs of latency and significant CPU overhead.

Storage Stack Layers

Linux Storage Stack: Latency at Each Layer
Application (cuFile)
~1 µs
VFS (Virtual File System)
~2-5 µs
File System (XFS/ext4)
~3-10 µs
Block Layer (blk-mq)
~2-5 µs
NVMe Driver
~1-2 µs
NVMe SSD
~80-100 µs
Total kernel overhead: 10-25 µs (10-25% of total I/O time)

io_uring: Modern Async I/O

✅ io_uring Benefits Submission Queue (SQ) and Completion Queue (CQ) in shared memory. Zero-copy. Batched submissions. Kernel-poll mode can reduce syscalls/context switches.

Traditional I/O

  • 1 syscall per I/O operation
  • Context switch overhead (~1-2 µs)
  • Data copy: user → kernel → device
  • Completion: poll or signal
  • ~500K IOPS max per core

io_uring

  • Batch N ops, 1 syscall (or 0 with SQPOLL)
  • Shared memory SQ/CQ, no copies
  • Kernel polling mode: zero syscalls
  • Registered buffers: avoids extra memcpy (direct DMA when possible)
  • ~2-3M IOPS per core
C
// io_uring with registered buffers for GPU-like access patterns

struct io_uring ring;
io_uring_queue_init(256, &ring, IORING_SETUP_SQPOLL);  // Kernel polling

// Register fixed buffers (avoids per-I/O buffer registration)
struct iovec iovecs[16];
for (int i = 0; i < 16; i++) {
    iovecs[i].iov_base = aligned_alloc(4096, BUFFER_SIZE);
    iovecs[i].iov_len = BUFFER_SIZE;
}
io_uring_register_buffers(&ring, iovecs, 16);

// Submit batched reads
for (int i = 0; i < batch_size; i++) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read_fixed(sqe, fd, iovecs[i % 16].iov_base, 
                              size, offset, i % 16);
    sqe->user_data = i;
}
io_uring_submit(&ring);  // Single syscall for all
⚠️ SQPOLL Requirements
  • CAP_SYS_NICE required: SQPOLL spawns a kernel thread that busy-polls
  • CPU dedication: The sq_thread pins to a CPU core (100% utilized)
  • Idle timeout: After sq_thread_idle ms, thread sleeps (default 1000ms)
  • Not always faster: For bursty GPU checkpoints, regular io_uring may match SQPOLL

io_uring Modes Comparison

Mode Syscalls/batch CPU Overhead Best For
Regular 1 submit + 1 wait Low General purpose
SQPOLL 0 High (dedicated core) Sustained high IOPS
IOPOLL 1 submit, 0 complete Medium NVMe with polling
SQPOLL + IOPOLL 0 Very High Ultra-low latency

SPDK: Complete Kernel Bypass

🔧 SPDK (Storage Performance Development Kit) User-space NVMe driver. Completely bypasses the kernel. Polls NVMe completion queues directly. Used by Ceph, DAOS, and HFT systems.
Kernel Path
Application
↓ syscall
Kernel (VFS/FS/Block)
NVMe Driver
NVMe SSD
~100-120 µs
SPDK Path
Application
↓ function call
SPDK NVMe (user-space)
↓ VFIO/UIO
NVMe SSD
~85-95 µs
⚠️ SPDK + GPU Caveat SPDK doesn't natively integrate with GDS. You'd need to manage GPU memory registration yourself. For most GPU workloads, GDS with io_uring is more practical.
3

Kubernetes & CSI Integration

🚨 Reality Check 80%+ of AI workloads deploy in Kubernetes. If GDS doesn't work in your container orchestration, it doesn't work in production.

GDS in Containers: Challenges

Requirement Challenge Solution
GPU Access Container needs GPU device NVIDIA Device Plugin
NVMe Access Raw NVMe access for GDS Privileged OR device plugin
RDMA Access RDMA device for GPUDirect RDMA device plugin, host network
Huge Pages GDS uses huge pages for DMA hugePages resource request
File System GDS needs specific mount opts CSI driver with GDS provisioning

Pod Spec for GDS Workloads

YAML
# GDS-enabled AI training pod
apiVersion: v1
kind: Pod
metadata:
  name: gds-training-pod
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 8
        hugepages-2Mi: 4Gi         # Required for GDS DMA
        rdma/rdma_shared_device_a: 1  # GPUDirect RDMA
    volumeMounts:
    - name: training-data
      mountPath: /data
    - name: nvme-direct        # Raw NVMe for GDS
      mountPath: /dev/nvme0n1
    securityContext:
      privileged: true         # Required for device access
      capabilities:
        add: [SYS_ADMIN, IPC_LOCK]
  
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: gds-pvc
  - name: nvme-direct
    hostPath:
      path: /dev/nvme0n1
      type: BlockDevice

CSI Drivers for GPU Storage

NVIDIA GPUDirect Storage CSI Recommended

Official NVIDIA CSI driver with GDS support. Handles device registration, huge pages, and mount options automatically.

  • Auto-registers NVMe with GDS
  • Handles cuFile config injection
  • Works with local and NVMe-oF
  • Requires NVIDIA GPU Operator

Dell CSI PowerScale Enterprise

Enterprise storage arrays with NVMe-oF backend. CSI driver handles multipathing and GDS compatibility.

  • NVMe-oF/RDMA backend
  • ANA multipathing built-in
  • Snapshot and clone support
  • Enterprise support contract

Pure Storage CSI Enterprise

FlashArray and FlashBlade with NVMe-oF. DirectPath for kernel bypass.

  • DirectPath I/O (reduced latency)
  • NVMe/RoCE and NVMe/TCP
  • GPUDirect Storage certified
  • Kubernetes-native management

OpenEBS Mayastor Open Source

Cloud-native storage with NVMe-oF backend. Good for on-prem Kubernetes clusters.

  • NVMe-oF/TCP based
  • Replication and snapshots
  • No GDS optimization (yet)
  • CNCF Sandbox project

StorageClass for GDS

YAML
# StorageClass optimized for GDS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gds-nvme
provisioner: csi.nvidia.com
parameters:
  type: nvme-local
  fsType: xfs
  mkfsOptions: "-K"           # Don't discard on mkfs
mountOptions:
  - noatime                        # Don't update access times
  - nodiratime
  - nobarrier                      # SSD doesn't need barriers
  - logbufs=8                      # XFS log buffers
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
4

File Systems for GDS

GDS Compatibility Matrix

File System GDS Support Notes
ext4 ✓ Full Support Most common. Mount: -o nobarrier,data=ordered
XFS ✓ Full Support Better for large files, parallel I/O. Recommended for checkpoints.
Lustre ✓ Full Support Parallel FS. Requires lustre-client-gds. HPC standard.
GPFS / Spectrum Scale ✓ Full Support Enterprise parallel FS. Native GDS (v5.1+). IBM AI infra.
WekaFS ✓ Full Support Purpose-built for AI. Native GDS. Highest performance.
BeeGFS ⚠ Partial Requires tuning. Check version compatibility.
NFS ✗ Not Supported No O_DIRECT to NFS. Use NVMe-oF instead.
CIFS/SMB ✗ Not Supported Windows protocol. No GDS path.

O_DIRECT Requirement

⚠️ Critical GDS requires O_DIRECT to bypass the page cache. Files must be opened with O_DIRECT flag, and I/O must be aligned.
C
// O_DIRECT requirements
int fd = open(path, O_RDONLY | O_DIRECT);

// Alignment requirements (typically 4KB)
size_t alignment = 4096;
void* buffer;
posix_memalign(&buffer, alignment, size);

// Size must be multiple of alignment
size = (size + alignment - 1) & ~(alignment - 1);
💡 Tip Use cuFileDriverGetProperties() to query the required alignment for your system.

Parallel Filesystem Tuning

Bash
# Lustre tuning for GDS
lctl set_param llite.*.max_read_ahead_mb=0      # Disable readahead (GDS handles)
lctl set_param osc.*.max_pages_per_rpc=1024     # Larger RPCs
lctl set_param osc.*.max_rpcs_in_flight=32      # More parallelism

# XFS tuning
mkfs.xfs -d su=1m,sw=4 /dev/nvme0n1              # Stripe unit/width for RAID
mount -o noatime,nodiratime,logbufs=8 /dev/nvme0n1 /data