NVMe-oF, Linux storage stack, Kubernetes integration, error handling, security, cost analysis, and benchmarking — the missing pieces for production GPU-storage systems.
Error handling, failure mode analysis, debugging tools, benchmarking methodology, and performance validation.
| Failure Mode | Detection | Impact | Recovery |
|---|---|---|---|
| NVMe Command Timeout | nvme_timeout (default 30s) | I/O hangs, potential GPU stall | Controller reset, retry or abort |
| DMA Transfer Failure | cuFileRead returns error | Partial data, corrupted tensor | Retry with fallback to CPU path |
| PCIe Link Error | AER (Advanced Error Reporting) | Device offline, all I/O fails | Link retrain, device re-enumeration |
| NVMe-oF Path Down | ANA state change notification | Path unavailable | Failover to alternate path |
| SSD Internal Error | SMART critical warning | Data at risk | Evacuate data, replace drive |
| GPU Memory Error | ECC error, CUDA error | Corrupted DMA target | Retry to different GPU memory |
AER is critical for detecting and diagnosing PCIe link issues between GPUs and NVMe devices. Most production failures start with AER errors long before complete failure.
| Error Type | Severity | Action |
| Correctable errors (low rate) | Info | Monitor, log |
| Correctable errors (high rate, >100/hour) | Warning | Schedule maintenance |
| Completion Timeout | Critical | Immediate investigation |
| Uncorrectable Non-Fatal | Critical | Plan device replacement |
| Uncorrectable Fatal | Emergency | Device failed, evacuate data |
Before deploying to production, you MUST test your error handling code. The only way to know your checkpoint recovery works is to break things intentionally. Here's how to do it safely.
| Test | Expected Behavior | Pass If |
| Write error during checkpoint | Retry or fail gracefully | No corruption, no crash |
| Read error during load | Fall back to backup | Loads older valid checkpoint |
| Device hot-remove | IO errors, graceful degradation | System doesn't panic |
| Controller reset | Brief unavailability | Recovers within 30s |
| High error rate (50%) | Performance degradation | Correct data, slower |
| Failure Mode | Symptoms | Detection | Recovery | Prevention |
|---|---|---|---|---|
| NVMe Timeout | I/O hangs, dmesg errors, training stalls | nvme_timeout in kernel log |
Controller reset, worst case reboot | Proper timeout tuning, avoid power saving |
| Thermal Throttling | Gradual performance drop, high latency spikes | SMART temp >70°C, reduced bandwidth | Improve cooling, reduce workload | Heatsinks, airflow, monitor temps |
| PCIe Link Degradation | 50-75% bandwidth loss | lspci shows x1 instead of x4 |
Reseat SSD, replace riser | Quality motherboard, check seating |
| Silent Data Corruption | Model produces NaN, checkpoint won't load | Checksum verification fails | Restore from verified backup | Use enterprise SSD with E2E protection |
| RAID Array Degradation | One drive fails, array rebuilding | mdadm status shows degraded | Replace failed drive, rebuild | RAID-5/6, hot spare, monitoring |
| Filesystem Corruption | Mount fails, I/O errors | xfs_repair finds errors |
fsck/xfs_repair, restore from backup | Journaling, proper shutdown, UPS |
#!/bin/bash
# storage_watchdog.sh - Continuous monitoring for AI training
# Run as: nohup ./storage_watchdog.sh &
LOG_FILE="/var/log/storage_watchdog.log"
ALERT_WEBHOOK="$SLACK_WEBHOOK_URL"
CHECK_INTERVAL=30 # seconds
send_alert() {
local severity=$1
local message=$2
echo "$(date -Iseconds) [$severity] $message" >> $LOG_FILE
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[$severity] Storage Alert: $message\"}" \
$ALERT_WEBHOOK
}
check_nvme_health() {
for dev in /dev/nvme*n1; do
# Check for controller errors
if dmesg | tail -100 | grep -q "nvme.*timeout\|nvme.*error"; then
send_alert "CRITICAL" "NVMe timeout/error detected on $dev"
# Attempt controller reset
echo 1 > /sys/block/$(basename $dev)/device/reset_controller
sleep 5
if nvme smart-log $dev &>/dev/null; then
send_alert "INFO" "Controller reset successful for $dev"
else
send_alert "CRITICAL" "Controller reset FAILED for $dev - manual intervention required"
fi
fi
# Check temperature
temp=$(nvme smart-log $dev -o json 2>/dev/null | jq '.temperature - 273')
if (( temp > 75 )); then
send_alert "WARNING" "High temperature on $dev: ${temp}°C"
fi
# Check for media errors
media_errors=$(nvme smart-log $dev -o json 2>/dev/null | jq '.media_errors')
if (( media_errors > 0 )); then
send_alert "CRITICAL" "Media errors detected on $dev: $media_errors - REPLACE DRIVE"
fi
done
}
check_raid_health() {
for md in /dev/md*; do
if [[ -b $md ]]; then
state=$(cat /sys/block/$(basename $md)/md/array_state 2>/dev/null)
if [[ $state != "clean" && $state != "active" ]]; then
send_alert "CRITICAL" "RAID array $md in state: $state"
fi
fi
done
}
check_pcie_link() {
for dev in /dev/nvme*; do
pci_addr=$(readlink -f /sys/class/nvme/$(basename $dev)/device | grep -oP '[0-9a-f:]+\.[0-9]')
if [[ -n $pci_addr ]]; then
link_status=$(lspci -vvv -s $pci_addr 2>/dev/null | grep "LnkSta:")
if echo $link_status | grep -qP "Width x[12],"; then
send_alert "WARNING" "PCIe link degraded on $dev: $link_status"
fi
fi
done
}
# Main loop
echo "Storage watchdog started at $(date)" >> $LOG_FILE
while true; do
check_nvme_health
check_raid_health
check_pcie_link
sleep $CHECK_INTERVAL
done
# checkpoint_integrity.py - Verify checkpoint files aren't corrupted
import hashlib
import json
import torch
from pathlib import Path
def compute_checkpoint_hash(checkpoint_path: str) -> str:
"""Compute SHA256 hash of checkpoint file"""
sha256 = hashlib.sha256()
with open(checkpoint_path, 'rb') as f:
for chunk in iter(lambda: f.read(8192 * 1024), b''): # 8MB chunks
sha256.update(chunk)
return sha256.hexdigest()
def save_with_verification(state_dict: dict, path: str):
"""Save checkpoint with integrity metadata"""
torch.save(state_dict, path)
# Compute and save hash
checksum = compute_checkpoint_hash(path)
meta_path = path + '.meta.json'
with open(meta_path, 'w') as f:
json.dump({
'sha256': checksum,
'size_bytes': Path(path).stat().st_size,
'timestamp': datetime.now().isoformat()
}, f)
# Verify immediately after save
if compute_checkpoint_hash(path) != checksum:
raise RuntimeError(f"Checkpoint verification failed immediately after save: {path}")
def load_with_verification(path: str) -> dict:
"""Load checkpoint with integrity check"""
meta_path = path + '.meta.json'
if Path(meta_path).exists():
with open(meta_path) as f:
meta = json.load(f)
current_hash = compute_checkpoint_hash(path)
if current_hash != meta['sha256']:
raise RuntimeError(
f"CHECKPOINT CORRUPTION DETECTED!\n"
f"Expected: {meta['sha256']}\n"
f"Got: {current_hash}"
)
return torch.load(path, weights_only=False)
# Usage in training loop:
# save_with_verification(model.state_dict(), '/checkpoints/step_10000.pt')
# state_dict = load_with_verification('/checkpoints/step_10000.pt')
# Essential nvme-cli commands for GPU-storage debugging
# 1. Full device identification
nvme id-ctrl /dev/nvme0 -H # Human-readable controller info
# Key fields: MDTS (max data transfer size), SGL support, CMB support
# 2. Namespace capabilities
nvme id-ns /dev/nvme0n1 -H
# Check: LBAF (LBA formats), metadata support, deallocation support
# 3. Latency histogram (if supported)
nvme intel lat-stats /dev/nvme0 -w # Write latency histogram
nvme intel lat-stats /dev/nvme0 -r # Read latency histogram
# 4. Internal temperature sensors
nvme smart-log /dev/nvme0 -o json | jq '.temperature_sensor'
# 5. Command effects log (what each command does)
nvme effects-log /dev/nvme0
# 6. Feature settings
nvme get-feature /dev/nvme0 -f 0x01 -H # Arbitration
nvme get-feature /dev/nvme0 -f 0x02 -H # Power Management
nvme get-feature /dev/nvme0 -f 0x05 -H # Temperature Threshold
nvme get-feature /dev/nvme0 -f 0x07 -H # Number of Queues
nvme get-feature /dev/nvme0 -f 0x0c -H # APST (power states)
# 7. Error log entries
nvme error-log /dev/nvme0 -e 64 # Last 64 errors
# 8. Self-test (background)
nvme device-self-test /dev/nvme0 -s 1 # Short test
nvme device-self-test /dev/nvme0 -s 2 # Extended test
nvme self-test-log /dev/nvme0 # Check results
# blktrace captures block-level I/O events
# Install
sudo apt-get install blktrace
# Capture trace during training (10 seconds)
sudo blktrace -d /dev/nvme0n1 -w 10 -o trace
# Analyze with blkparse
blkparse -i trace -d trace.bin
# Output shows: timestamp, CPU, action, RWBS, start_sector, size
# Quick summary with btt
btt -i trace.bin -l trace.latency
# Shows: Q2Q (queue to queue), D2C (dispatch to complete)
# Identify read/write patterns
blkparse -i trace | awk '/R |W / {print $6, $7, $8}' | \
sort -n | uniq -c | sort -rn | head -20
# Shows most common I/O sizes and offsets
# Detect sequential vs random
blkparse -i trace | awk '
/R |W / {
if (NR > 1 && $7 == prev_end) seq++;
else rand++;
prev_end = $7 + $8;
}
END { print "Sequential:", seq, "Random:", rand }'
# Live monitoring (one-liner)
sudo blktrace -d /dev/nvme0n1 -o - | blkparse -i - -o /dev/stdout | \
awk '/R |W / {sum+=$8; count++} END {print "Avg I/O size:", sum/count*512, "bytes"}'
# bpftrace: Dynamic tracing for storage debugging
# Install
sudo apt-get install bpftrace
# 1. NVMe command latency distribution
sudo bpftrace -e '
kprobe:nvme_queue_rq { @start[arg0] = nsecs; }
kretprobe:nvme_queue_rq /@start[arg0]/ {
@latency_us = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}'
# 2. Track large I/O requests (>1MB)
sudo bpftrace -e '
tracepoint:block:block_rq_issue
/args->bytes > 1048576/ {
printf("%s: %s %d bytes at %lld\n",
comm, args->rwbs, args->bytes, args->sector);
}'
# 3. Find which process is doing I/O
sudo bpftrace -e '
tracepoint:block:block_rq_complete {
@io_by_proc[comm] = sum(args->nr_sector * 512);
}'
# 4. Detect I/O stalls (>10ms latency)
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->sector]/ {
$lat = (nsecs - @start[args->sector]) / 1000000;
if ($lat > 10) {
printf("SLOW I/O: %d ms, sector %lld\n", $lat, args->sector);
}
delete(@start[args->sector]);
}'
# 5. GDS (cuFile) call tracing
sudo bpftrace -e '
uprobe:/usr/lib/x86_64-linux-gnu/libcufile.so:cuFileRead {
@reads = count();
@read_start[tid] = nsecs;
}
uretprobe:/usr/lib/x86_64-linux-gnu/libcufile.so:cuFileRead /@read_start[tid]/ {
@read_latency = hist((nsecs - @read_start[tid]) / 1000);
delete(@read_start[tid]);
}'
# iostat: Quick I/O performance overview
# Extended stats, 1-second intervals
iostat -xz 1 /dev/nvme0n1
# Key columns:
# r/s, w/s: IOPS
# rMB/s, wMB/s: Throughput
# r_await, w_await: Latency in ms (should be <1 for NVMe)
# aqu-sz: Queue depth
# %util: Utilization (misleading for NVMe, ignore)
# Alert on high latency
iostat -xz 1 | awk '/nvme/ && $10 > 5 {print "HIGH LATENCY:", $0}'
# JSON output for programmatic parsing
iostat -xzo JSON 1 1 | jq '.sysstat.hosts[0].statistics[0].disk'
# fio: Generate controlled workloads for diagnosis
# 1. Baseline sequential read (should match SSD spec)
fio --name=seq_read --filename=/dev/nvme0n1 --direct=1 \
--rw=read --bs=128k --iodepth=32 --numjobs=4 \
--runtime=30 --group_reporting
# 2. 4K random read (tests IOPS)
fio --name=rand_read --filename=/dev/nvme0n1 --direct=1 \
--rw=randread --bs=4k --iodepth=256 --numjobs=4 \
--runtime=30 --group_reporting
# 3. Checkpoint simulation (large sequential writes)
fio --name=checkpoint --filename=/mnt/nvme/test --direct=1 \
--rw=write --bs=1m --iodepth=8 --numjobs=1 \
--size=10G --runtime=60 --group_reporting
# 4. Mixed training workload simulation
fio --name=training --filename=/mnt/nvme/test --direct=1 \
--rw=randrw --rwmixread=90 --bs=64k --iodepth=16 \
--numjobs=8 --runtime=60 --group_reporting \
--lat_percentiles=1 --percentile_list=50:90:99:99.9
# 5. GDS-like I/O pattern (io_uring, large blocks)
fio --name=gds_like --filename=/dev/nvme0n1 --direct=1 \
--ioengine=io_uring --rw=read --bs=1m --iodepth=64 \
--numjobs=4 --runtime=30 --group_reporting
Here's actual output from a healthy enterprise NVMe SSD (Samsung PM9A3 7.68TB) - use these as baselines:
# ============================================
# Sequential Read Baseline (should see 6.5+ GB/s)
# ============================================
$ fio --name=seq_read --filename=/dev/nvme0n1 --direct=1 \
--rw=read --bs=128k --iodepth=32 --numjobs=4 --runtime=30
seq_read: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB
cpu : usr=2.31%, sys=18.42%, ctx=1623847, majf=0, minf=524
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=1594832,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=6847MiB/s (7179MB/s), 1711MiB/s-1712MiB/s (1795MB/s-1795MB/s), io=200GiB (215GB), run=30001-30001msec
# GOOD: 6.8 GB/s sequential read (close to spec 7.0 GB/s)
# BAD: <5 GB/s would indicate thermal throttling or link issues
# ============================================
# 4K Random Read IOPS (should see 1M+ IOPS)
# ============================================
$ fio --name=rand_read --filename=/dev/nvme0n1 --direct=1 \
--rw=randread --bs=4k --iodepth=256 --numjobs=4 --runtime=30 \
--lat_percentiles=1
rand_read: (groupid=0, jobs=4): err= 0: pid=12345: Sat Dec 21 10:30:45 2024
read: IOPS=1052k, BW=4109MiB/s (4309MB/s)(120GiB/30001msec)
slat (nsec): min=1203, max=89432, avg=2847.32, stdev=1023.11
clat (usec): min=48, max=2847, avg=92.31, stdev=24.87
lat (usec): min=51, max=2853, avg=95.16, stdev=25.12
clat percentiles (usec):
| 1.00th=[ 62], 5.00th=[ 68], 10.00th=[ 72], 20.00th=[ 77],
| 30.00th=[ 81], 40.00th=[ 85], 50.00th=[ 89], 60.00th=[ 93],
| 70.00th=[ 98], 80.00th=[ 105], 90.00th=[ 118], 95.00th=[ 133],
| 99.00th=[ 174], 99.50th=[ 200], 99.90th=[ 302], 99.95th=[ 383],
| 99.99th=[ 783]
bw ( MiB/s): min= 3987, max= 4234, per=100.00%, avg=4109.42, stdev=23.14
iops : min=1020832, max=1083904, avg=1052012.34, stdev=5923.41
lat (usec) : 50=0.01%, 100=72.31%, 250=27.52%, 500=0.15%, 750=0.01%
lat (usec) : 1000=0.01%
# GOOD: 1.05M IOPS, 92µs average latency, P99=174µs
# BAD: <800K IOPS or P99 >500µs indicates issues
# ============================================
# Checkpoint Write Pattern (sustained writes)
# ============================================
$ fio --name=checkpoint --filename=/mnt/nvme/ckpt_test --direct=1 \
--rw=write --bs=1m --iodepth=8 --numjobs=4 --size=50G --runtime=60 \
--lat_percentiles=1
checkpoint: (groupid=0, jobs=4): err= 0
write: IOPS=4823, BW=4823MiB/s (5057MB/s)(200GiB/42457msec); 0 zone resets
slat (usec): min=18, max=1234, avg=28.43, stdev=12.34
clat (usec): min=312, max=18432, avg=1623.21, stdev=834.12
lat (usec): min=342, max=18523, avg=1651.64, stdev=839.23
clat percentiles (usec):
| 1.00th=[ 603], 5.00th=[ 734], 10.00th=[ 832], 20.00th=[ 1012],
| 30.00th=[ 1172], 40.00th=[ 1336], 50.00th=[ 1500], 60.00th=[ 1680],
| 70.00th=[ 1876], 80.00th=[ 2147], 90.00th=[ 2638], 95.00th=[ 3163],
| 99.00th=[ 4621], 99.50th=[ 6587], 99.90th=[10945], 99.95th=[13698]
# NOTE: Sustained write latency increases over time due to GC
# GOOD: 4.8 GB/s sustained, P99 ~4.6ms
# BAD: P99 >20ms indicates GC pressure - check WAF!
# ============================================
# Mixed Read/Write (AI Training Simulation)
# ============================================
$ fio --name=training --filename=/mnt/nvme/train_test --direct=1 \
--rw=randrw --rwmixread=90 --bs=64k --iodepth=16 --numjobs=8 \
--runtime=60 --group_reporting --lat_percentiles=1
training: (groupid=0, jobs=8): err= 0
read: IOPS=48234, BW=2949MiB/s (3092MB/s)(173GiB/60001msec)
clat (usec): min=89, max=8934, avg=245.32, stdev=123.45
clat percentiles (usec):
| 50.00th=[ 223], 90.00th=[ 359], 95.00th=[ 445], 99.00th=[ 701]
write: IOPS=5359, BW=327MiB/s (343MB/s)(19.2GiB/60001msec); 0 zone resets
clat (usec): min=234, max=15234, avg=834.21, stdev=345.67
clat percentiles (usec):
| 50.00th=[ 734], 90.00th=[ 1254], 95.00th=[ 1614], 99.00th=[ 2868]
# GOOD: Read ~3 GB/s, Write 327 MB/s (90/10 mix)
# GOOD: Read latency P99 <1ms during mixed workload
# Matches typical training batch loading pattern
# ============================================
# io_uring with GDS-like Pattern
# ============================================
$ fio --name=gds_like --filename=/dev/nvme0n1 --direct=1 \
--ioengine=io_uring --hipri=1 --sqthread_poll=1 \
--rw=read --bs=1m --iodepth=64 --numjobs=4 --runtime=30
gds_like: (groupid=0, jobs=4): err= 0
read: IOPS=6234, BW=6234MiB/s (6537MB/s)(182GiB/30001msec)
slat (nsec): min=934, max=23421, avg=1823.12, stdev=423.34
clat (usec): min=187, max=12345, avg=398.23, stdev=187.34
cpu : usr=1.23%, sys=8.45%, ctx=23456, majf=0, minf=1284
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=12.3%, >=64=86.9%
# GOOD: 6.2 GB/s with io_uring SQPOLL
# GOOD: Low CPU overhead (sys=8.45%)
# GOOD: Submission latency ~1.8µs (slat)
# This is close to what GDS achieves for pure NVMe
# ============================================
# RED FLAG: Thermal Throttling
# ============================================
$ fio --name=sustained_write --filename=/dev/nvme0n1 --direct=1 \
--rw=write --bs=1m --iodepth=32 --numjobs=4 --runtime=300
# Output showing throttling:
write: IOPS=2134, BW=2134MiB/s (2238MB/s) # Started at 5 GB/s!
clat (msec): min=2, max=234, avg=58.23 # Huge latency spike!
clat percentiles (msec):
| 99.00th=[178], 99.50th=[201] # P99 >100ms is bad
# Check temperature during test:
$ nvme smart-log /dev/nvme0 | grep -i temp
temperature : 78°C # Throttle threshold!
Warning Coverage Temperature Time : 234 # 234 minutes above warning
# FIX: Improve cooling, reduce write intensity, add thermal pads
# ============================================
# RED FLAG: PCIe Link Degradation
# ============================================
$ fio --name=seq_read --filename=/dev/nvme0n1 --direct=1 \
--rw=read --bs=128k --iodepth=32 --numjobs=4 --runtime=30
# Suspiciously low bandwidth:
READ: bw=1623MiB/s (1702MB/s) # Should be 6+ GB/s!
# Check link status:
$ lspci -vvv -s 01:00.0 | grep -E "LnkCap|LnkSta"
LnkCap: Port #0, Speed 16GT/s (Gen4), Width x4
LnkSta: Speed 16GT/s (ok), Width x1 (downgraded from x4) # PROBLEM!
# FIX: Reseat SSD, check slot, replace riser card
$ echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove
$ echo 1 > /sys/bus/pci/rescan
NVIDIA's official GDS benchmark tool. Measures actual GPU-to-storage throughput via cuFile.
NVIDIA Data Loading Library. Benchmark with actual training data pipeline — the most realistic test.
kvikIO's built-in benchmarking for Python/RAPIDS workflows. Tests Zarr, Parquet, cuDF loading.
Write your own to match your exact workload pattern. Generic benchmarks don't capture your access patterns.
| Metric | What It Tells You | Target (8× NVMe + 8× GPU) |
|---|---|---|
| GPU-side throughput | Actual data rate to GPU memory | >50 GB/s aggregate |
| P99 latency | Worst-case latency (tail) | <500 µs for training |
| GPU utilization during load | Is storage keeping GPU fed? | >95% during compute phases |
| CPU utilization | Is CPU becoming bottleneck? | <20% for storage I/O |
| PCIe bandwidth | Link saturation | Check for contention |
#!/bin/bash
# gds_benchmark_suite.sh - Comprehensive GDS performance testing
# Produces comparable results across different systems
OUTPUT_DIR="./benchmark_results_$(date +%Y%m%d_%H%M%S)"
mkdir -p $OUTPUT_DIR
# System info
echo "=== System Configuration ===" | tee $OUTPUT_DIR/system_info.txt
nvidia-smi --query-gpu=name,memory.total,pcie.link.gen.current,pcie.link.width.current --format=csv >> $OUTPUT_DIR/system_info.txt
nvme list >> $OUTPUT_DIR/system_info.txt
lscpu | grep -E "Model name|Socket|Core|Thread" >> $OUTPUT_DIR/system_info.txt
# Test parameters
TEST_FILE="/mnt/nvme/gds_test_file"
TEST_SIZES=("1G" "10G" "100G")
BLOCK_SIZES=("64K" "256K" "1M" "4M")
IO_DEPTHS=(1 4 16 64)
# Pre-test: Warm up drive to steady-state
echo "Warming up drive..."
fio --name=warmup --filename=$TEST_FILE --direct=1 --rw=write \
--bs=1M --iodepth=64 --numjobs=4 --size=50G --runtime=60 --time_based
# Test 1: Sequential Read (GDS enabled)
echo "=== Sequential Read (GDS) ===" | tee -a $OUTPUT_DIR/results.txt
for bs in "${BLOCK_SIZES[@]}"; do
echo "Block size: $bs"
/usr/local/cuda/gds/tools/gdsio \
-f $TEST_FILE -d 0 -w 4 -s 10G -i $bs -x 0 -I 1 -T 30 \
2>&1 | tee -a $OUTPUT_DIR/results.txt
done
# Test 2: Sequential Read (GDS disabled, baseline)
echo "=== Sequential Read (NO GDS - baseline) ===" | tee -a $OUTPUT_DIR/results.txt
export CUFILE_ENV_PATH_JSON=/dev/null
for bs in "${BLOCK_SIZES[@]}"; do
fio --name=seq_read_nogds --filename=$TEST_FILE --direct=1 \
--rw=read --bs=$bs --iodepth=64 --numjobs=4 --runtime=30 \
--output-format=json | jq '.jobs[0].read.bw_bytes' >> $OUTPUT_DIR/results.txt
done
unset CUFILE_ENV_PATH_JSON
# Test 3: Random Read (worst case for GDS)
echo "=== Random 4K Read (GDS) ===" | tee -a $OUTPUT_DIR/results.txt
/usr/local/cuda/gds/tools/gdsio \
-f $TEST_FILE -d 0 -w 4 -s 10G -i 4K -x 1 -I 1 -T 30 \
2>&1 | tee -a $OUTPUT_DIR/results.txt
# Test 4: Checkpoint-like write pattern
echo "=== Checkpoint Write Pattern ===" | tee -a $OUTPUT_DIR/results.txt
for size in "${TEST_SIZES[@]}"; do
/usr/local/cuda/gds/tools/gdsio \
-f $TEST_FILE -d 0 -w 4 -s $size -i 4M -x 0 -I 0 -T 60 \
2>&1 | tee -a $OUTPUT_DIR/results.txt
done
# Generate summary
echo "=== SUMMARY ===" | tee -a $OUTPUT_DIR/results.txt
echo "Results saved to: $OUTPUT_DIR"
| Configuration | Seq Read (1M) | Seq Write (1M) | Rand 4K IOPS | GDS Speedup |
|---|---|---|---|---|
| 1x Gen4 NVMe (7GB/s) | 6.5-7.0 GB/s | 5.5-6.0 GB/s | 800K-1M | 1.3-1.5x |
| 1x Gen5 NVMe (14GB/s) | 12-14 GB/s | 10-12 GB/s | 1.5-2M | 1.4-1.6x |
| 4x Gen4 RAID-0 | 24-28 GB/s | 20-24 GB/s | 3-4M | 1.5-1.8x |
| 8x Gen5 RAID-0 | 80-100 GB/s | 60-80 GB/s | 8-12M | 1.6-2.0x |
MLPerf Storage is the industry standard for AI storage benchmarking. Official results: mlcommons.org/benchmarks/storage.
| Configuration | Accelerators | Storage | UNET3D (typical range) | BERT (typical range) |
|---|---|---|---|---|
| DGX-class + NVMe + GDS | 8x H100/A100 | Local NVMe RAID | 12,000-18,000 | 900K-1.3M |
| Parallel FS + GDS | 8x A100 | Lustre/GPFS/WekaFS | 10,000-15,000 | 800K-1.1M |
| Enterprise NAS | 8x A100 | NFS over RDMA | 7,000-12,000 | 600K-900K |
Illustrative results from gdsio on single-GPU + enterprise NVMe configurations:
| Block Size | Read BW (GDS) | Read BW (POSIX) | GDS Speedup | CPU Savings |
|---|---|---|---|---|
| 4 KB | 0.8 GB/s | 0.6 GB/s | 1.3x | 45% |
| 64 KB | 4.2 GB/s | 2.8 GB/s | 1.5x | 62% |
| 1 MB | 6.8 GB/s | 4.1 GB/s | 1.7x | 78% |
| 4 MB | 6.9 GB/s | 4.3 GB/s | 1.6x | 82% |
| 16 MB | 6.9 GB/s | 4.5 GB/s | 1.5x | 85% |
ImageNet training data loading with NVIDIA DALI:
| Configuration | Images/sec | GPU Util | CPU Util | Notes |
|---|---|---|---|---|
| PyTorch DataLoader (CPU decode) | 2,450 | 67% | 95% | CPU-bound decoding |
| DALI (GPU decode, POSIX) | 8,900 | 82% | 35% | Bounce buffer overhead |
| DALI (GPU decode, GDS) | 12,400 | 97% | 8% | Direct GPU path |
| DALI (GDS + nvJPEG2000) | 14,200 | 98% | 5% | Hardware JPEG decode |
| Path | P50 Latency | P99 Latency | P99.9 Latency | Jitter |
|---|---|---|---|---|
| GDS (local NVMe) | 70-100 µs | 150-250 µs | 300-600 µs | Low |
| POSIX (local NVMe) | 100-150 µs | 250-500 µs | 0.8-2 ms | Medium |
| NVMe-oF RDMA | 90-150 µs | 200-400 µs | 0.5-1.2 ms | Low |
| NVMe-oF TCP | 150-250 µs | 400-800 µs | 1-4 ms | High |
| NFS v4.1 | 300-800 µs | 1-5 ms | 5-20 ms | Very High |
# Scenario: LLM inference loading KV-cache pages from NVMe # Batch size = 256 samples, each needs 1 I/O operation SSD Latency Profile (measured, not marketing): P50: 80 µs # Median case P99: 400 µs # 1 in 100 P99.9: 5,000 µs # 1 in 1000 (GC event) Expected tail in batch of 256 I/Os: - P(no outlier > P99.9) = (0.999)^256 = 77.4% - P(at least one P99.9 outlier) = 22.6% Effective batch latency: Without outliers: ~80 µs (P50 dominates) With 1 outlier: 5,000 µs (batch waits for slowest) Weighted average batch latency: (0.774 × 80) + (0.226 × 5000) = 1,193 µs Naive expectation (P50): 80 µs Reality with batch blocking: 14.9× worse # This is why inference serving has SLO violations! # Your "fast" SSD with 80µs P50 acts like a 1.2ms SSD in batch mode
| Cause | Typical Impact | Frequency | Mitigation |
|---|---|---|---|
| Garbage Collection (GC) | 5-50 ms spikes | Depends on write volume, can be every few seconds under heavy writes | Over-provision (20-30%), use high-endurance SSDs, FDP/Streams |
| Wear Leveling | 1-10 ms spikes | Background, increases with SSD age | Monitor SSD health, replace at 80% life |
| Thermal Throttling | 2-5× latency increase (sustained) | Under heavy load, poor airflow | Proper cooling, workload spreading, temperature monitoring |
| Read Disturb / Refresh | 0.5-3 ms spikes | Every ~100K reads to same block | SSD firmware handles; mostly invisible |
| Controller Firmware | Variable | Firmware-dependent | Keep firmware updated, benchmark before deployment |
| Queue Depth Saturation | Queuing delays compound | Under high concurrency | Limit QD per drive, distribute across drives |
Higher queue depth increases throughput but degrades tail latency. For GPU workloads that need predictable latency, limit queue depth per drive.
| Queue Depth | Throughput | P50 Latency | P99 Latency | P99.9 Latency | Use Case |
|---|---|---|---|---|---|
| 1 | ~20% max | 70-90 µs | 100-150 µs | 200-400 µs | Ultra-low latency inference |
| 8 | ~60% max | 80-110 µs | 150-300 µs | 400-800 µs | Balanced inference |
| 32 | ~90% max | 100-150 µs | 250-500 µs | 1-3 ms | Training data loading |
| 128+ | ~100% max | 150-300 µs | 500-1500 µs | 3-10 ms | Throughput-only (checkpoints) |
In multi-tenant or mixed-workload environments, one tenant's write-heavy checkpoint can spike another tenant's read latency.
| Scenario | Read P99 Impact | Root Cause | Mitigation |
|---|---|---|---|
| Inference during checkpoint write | 5-20× increase | SSD internal write buffer competition | Separate drives for read vs write workloads |
| Multiple training jobs sharing SSD | 2-5× increase | Queue depth multiplied across jobs | NVMe namespaces, QoS, or separate drives |
| GDS + POSIX on same drive | 1.5-3× increase | Page cache pollution, lock contention | Use GDS exclusively or separate drives |
| SSD Model | Rated DWPD | Measured DWPD (AI checkpoint) | TBW (3.84TB model) | Vendor |
|---|---|---|---|---|
| Samsung PM9A3 | 1 DWPD | 0.8 DWPD | 7,008 TB | Samsung |
| Enterprise NVMe SSD | 1 DWPD | 0.9 DWPD | 7,008 TB | Enterprise |
| Kioxia CM7-R | 1 DWPD | 0.85 DWPD | 7,008 TB | Kioxia |
| Intel D7-P5620 | 3 DWPD | 2.7 DWPD | 21,024 TB | Solidigm |
| Samsung PM1733 | 1 DWPD | 0.9 DWPD | 7,008 TB | Samsung |
# Real-world checkpoint impact calculation Given: Model size: 70B parameters (140 GB in FP16) Optimizer state: 2× model size = 280 GB Total checkpoint: 420 GB Checkpoint frequency: Every 1,000 steps Steps per day: ~5,000 (depends on batch size) Checkpoints per day: 5 Daily Write Volume: 5 checkpoints × 420 GB = 2.1 TB/day Write Amplification Factor (WAF): Conventional NVMe: WAF = 2.5-4× (GC overhead on random writes) With FDP/Streams: WAF = 1.2-1.5× (minimal GC) With ZNS: WAF = 1.0× (no GC in zone) Effective Daily Writes (3.84TB SSD): Conventional: 2.1 TB × 3.5 WAF = 7.35 TB = 1.9 DWPD With FDP: 2.1 TB × 1.3 WAF = 2.73 TB = 0.7 DWPD SSD Lifetime (7,008 TBW rated): Conventional: 7,008 / 7.35 = 953 days (~2.6 years) With FDP: 7,008 / 2.73 = 2,567 days (~7 years) # For LLM training at scale: # - Use 3 DWPD enterprise drives (NOT consumer 0.3 DWPD) # - Enable FDP or use ZNS where supported # - Over-provision 20-30% to reduce GC frequency # - Monitor SMART: Percentage_Used, Media_Wearout_Indicator