C.6 TROUBLESHOOTING

Troubleshooting & Performance

Error handling, failure mode analysis, debugging tools, benchmarking methodology, and performance validation.

1. Error Handling & Recovery
2. Failure Mode Analysis
3. Debugging & Tracing Tools
4. Benchmarking Methodology
5. Reproducible Benchmark Suite
6. Benchmark Performance Ranges

1. Error Handling & Recovery

🚨 Production Killer: What happens when an NVMe SSD fails during a GPU DMA transfer? When a PCIe link flaps? When an NVMe-oF path goes down? If you can't answer these questions, you're not ready for production.

Failure Modes and Recovery

Failure Mode	Detection	Impact	Recovery
NVMe Command Timeout	nvme_timeout (default 30s)	I/O hangs, potential GPU stall	Controller reset, retry or abort
DMA Transfer Failure	cuFileRead returns error	Partial data, corrupted tensor	Retry with fallback to CPU path
PCIe Link Error	AER (Advanced Error Reporting)	Device offline, all I/O fails	Link retrain, device re-enumeration
NVMe-oF Path Down	ANA state change notification	Path unavailable	Failover to alternate path
SSD Internal Error	SMART critical warning	Data at risk	Evacuate data, replace drive
GPU Memory Error	ECC error, CUDA error	Corrupted DMA target	Retry to different GPU memory

GDS Error Handling Code

// Robust GDS read with error handling and fallback

ssize_t robust_gds_read(CUfileHandle_t handle, void* gpu_buf,
                          size_t size, off_t offset, int max_retries) {
    ssize_t ret;
    int retries = 0;
    
    while (retries < max_retries) {
        ret = cuFileRead(handle, gpu_buf, size, offset, 0);
        
        if (ret == size) {
            return ret;  // Success
        }
        
        if (ret < 0) {
            CUfileError_t err = (CUfileError_t)(-ret);
            
            switch (err) {
                case CU_FILE_DRIVER_NOT_INITIALIZED:
                    // GDS driver issue - reinitialize
                    cuFileDriverClose();
                    cuFileDriverOpen();
                    break;
                    
                case CU_FILE_IO_ERROR:
                    // Storage I/O error - might be transient
                    usleep(1000 * (retries + 1));  // Backoff
                    break;
                    
                case CU_FILE_INVALID_MAPPING:
                    // Buffer registration issue - fallback to CPU
                    return fallback_cpu_read(handle, gpu_buf, size, offset);
                    
                case CU_FILE_CUDA_DRIVER_ERROR:
                    // GPU error - check CUDA status
                    cudaError_t cuda_err = cudaGetLastError();
                    log_error("CUDA error: %s", cudaGetErrorString(cuda_err));
                    if (cuda_err == cudaErrorECCUncorrectable) {
                        // GPU memory error - fatal
                        return -1;
                    }
                    break;
                    
                default:
                    log_error("Unknown GDS error: %d", err);
                    break;
            }
        }
        
        retries++;
    }
    
    // All retries failed - use CPU fallback
    log_warning("GDS failed after %d retries, using CPU path", max_retries);
    return fallback_cpu_read(handle, gpu_buf, size, offset);
}

// CPU fallback: read to host memory, then copy to GPU
ssize_t fallback_cpu_read(CUfileHandle_t handle, void* gpu_buf,
                            size_t size, off_t offset) {
    void* host_buf = aligned_alloc(4096, size);
    ssize_t ret = pread(cuFileGetFd(handle), host_buf, size, offset);
    
    if (ret > 0) {
        cudaMemcpy(gpu_buf, host_buf, ret, cudaMemcpyHostToDevice);
    }
    
    free(host_buf);
    return ret;
}
            

NVMe Health Monitoring

# Check NVMe SMART health $ nvme smart-log /dev/nvme0 Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 <-- Should be 0 temperature : 42°C available_spare : 100% <-- Watch for < 10% available_spare_threshold : 10% percentage_used : 2% <-- Endurance consumed media_and_data_integrity_errors : 0 <-- Must be 0 num_err_log_entries : 0 # Watch for critical warnings $ nvme get-feature /dev/nvme0 -f 0x0b Asynchronous Event Configuration (feature id 0x0b): Critical Warnings: Available Spare, Temperature, Reliability, Read-only, Backup # Set up persistent event log monitoring $ nvme persistent-event-log /dev/nvme0 -a 1 | head -50

PCIe Advanced Error Reporting (AER)

AER is critical for detecting and diagnosing PCIe link issues between GPUs and NVMe devices. Most production failures start with AER errors long before complete failure.

# Check AER status and error counters

# Enable AER if not already (usually done at boot)
$ lspci -vvv -s $(lspci | grep -i nvme | head -1 | cut -d' ' -f1) | grep -A 20 "Advanced Error"
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- ...
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- ...
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-

# Key AER error types to watch:
# UESta (Uncorrectable Errors) - CRITICAL, usually cause device failure
#   DLP: Data Link Protocol Error - PCIe link layer issue
#   TLP: Transaction Layer Protocol Error - Malformed packet
#   CmpltTO: Completion Timeout - Device not responding
#   CmpltAbrt: Completion Abort - Transaction rejected
# CESta (Correctable Errors) - Warning signs, investigate if frequent
#   RxErr: Receiver Error - Bit errors being corrected
#   BadTLP/BadDLLP: CRC errors - Signal integrity issue

# Real-time AER monitoring via kernel
$ dmesg -w | grep -E "(AER|PCIe|nvme.*error)"
[12345.678] pcieport 0000:00:03.0: AER: Corrected error received: 0000:04:00.0
[12345.679] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer

# Check error counters in sysfs
$ cat /sys/bus/pci/devices/0000:04:00.0/aer_dev_correctable
RxErr 0
BadTLP 0
BadDLLP 3    <-- Warning: Some DLLP errors
Rollover 0
Timeout 0
NonFatalErr 0
CorrIntErr 0
HeaderOF 0
TOTAL_ERR_COR 3

$ cat /sys/bus/pci/devices/0000:04:00.0/aer_dev_fatal
TOTAL_ERR_FATAL 0   <-- Must be 0!
            

PCIe Link Degradation Detection

# Check if PCIe link is running at expected speed/width

$ lspci -vvv -s 0000:04:00.0 | grep -E "(LnkCap|LnkSta)"
LnkCap: Port #0, Speed 32GT/s, Width x4, ...    <-- Capability (what it CAN do)
LnkSta: Speed 32GT/s, Width x4, ...              <-- Status (what it IS doing)

# CRITICAL: If LnkSta doesn't match LnkCap, you have degradation!
# Example degraded link:
LnkCap: Port #0, Speed 32GT/s, Width x4
LnkSta: Speed 16GT/s, Width x2              <-- DEGRADED! 4x slower

# Automated link health check script
import subprocess
import re

def check_pcie_link_health(device):
    """Check if PCIe link is at full speed/width"""
    result = subprocess.run(
        ['lspci', '-vvv', '-s', device],
        capture_output=True, text=True
    )
    
    cap_match = re.search(r'LnkCap:.*Speed (\d+)GT/s.*Width x(\d+)', result.stdout)
    sta_match = re.search(r'LnkSta:.*Speed (\d+)GT/s.*Width x(\d+)', result.stdout)
    
    if cap_match and sta_match:
        cap_speed, cap_width = int(cap_match.group(1)), int(cap_match.group(2))
        sta_speed, sta_width = int(sta_match.group(1)), int(sta_match.group(2))
        
        degraded = sta_speed < cap_speed or sta_width < cap_width
        
        return {
            'device': device,
            'capability': f'Gen{cap_speed//8} x{cap_width}',
            'status': f'Gen{sta_speed//8} x{sta_width}',
            'degraded': degraded,
            'bandwidth_loss': 1 - (sta_speed * sta_width) / (cap_speed * cap_width)
        }
    return None

# Check all NVMe devices
for line in subprocess.getoutput('lspci | grep -i nvme').split('\n'):
    device = line.split()[0]
    health = check_pcie_link_health(device)
    if health and health['degraded']:
        print(f"  {device}: DEGRADED - {health['status']} (should be {health['capability']})")
        print(f"   Bandwidth loss: {health['bandwidth_loss']*100:.0f}%")
            

PCIe Link Recovery

# Force PCIe link retrain (may recover degraded link)

# Method 1: Trigger secondary bus reset (disruptive)
$ echo 1 > /sys/bus/pci/devices/0000:00:03.0/reset
# WARNING: This will reset ALL devices under this bridge!

# Method 2: Remove and rescan (less disruptive)
$ echo 1 > /sys/bus/pci/devices/0000:04:00.0/remove
$ echo 1 > /sys/bus/pci/rescan

# Method 3: Use setpci to trigger retrain (least disruptive)
# Read Link Control register, set retrain bit
$ setpci -s 0000:00:03.0 CAP_EXP+10.w=0020

# Verify link retrained successfully
$ sleep 1 && lspci -vvv -s 0000:04:00.0 | grep LnkSta

# Common causes of link degradation:
# 1. Loose riser card or M.2 slot
# 2. Thermal issues causing signal integrity problems
# 3. Dust/debris in connector
# 4. Incompatible PCIe gen negotiation
# 5. Marginal power supply
            

🚨 AER Error Action Matrix:

Error Type	Severity	Action
Correctable errors (low rate)	Info	Monitor, log
Correctable errors (high rate, >100/hour)	Warning	Schedule maintenance
Completion Timeout	Critical	Immediate investigation
Uncorrectable Non-Fatal	Critical	Plan device replacement
Uncorrectable Fatal	Emergency	Device failed, evacuate data

Error Injection Testing for GPU-NVMe Systems

Before deploying to production, you MUST test your error handling code. The only way to know your checkpoint recovery works is to break things intentionally. Here's how to do it safely.

Test Environment Only! Error injection can cause data loss. Never run on production systems. Use a dedicated test NVMe with no valuable data.

NVMe Fault Injection via Linux Kernel

# Enable NVMe fault injection (requires CONFIG_FAULT_INJECTION in kernel)

# Check if fault injection is available
$ ls /sys/kernel/debug/nvme*/fault_inject/ 2>/dev/null
/sys/kernel/debug/nvme0/fault_inject/
+ -  -  dont_retry
+ -  -  probability
+ -  -  space
+ -  -  status
+ -  -  task-filter
+ -  -  times
- -  -  verbose

# Basic setup: Inject failures on 1% of commands
$ echo 1 > /sys/kernel/debug/nvme0/fault_inject/probability   # 1% chance
$ echo 100 > /sys/kernel/debug/nvme0/fault_inject/times        # Max 100 injections
$ echo 1 > /sys/kernel/debug/nvme0/fault_inject/verbose        # Log to dmesg
$ echo 0x1 > /sys/kernel/debug/nvme0/fault_inject/status       # Inject generic error

# NVMe Status Codes to inject (from NVMe spec):
# 0x00: Generic Command Failed
# 0x01: Invalid Command Opcode
# 0x02: Invalid Field in Command
# 0x80: Write Fault
# 0x81: Unrecovered Read Error
# 0x82: ECC Guard Check Error
# 0x83: Write Protection Error
# 0x85: Compare Failure

# Test specific error: Unrecovered Read Error (0x81)
$ echo 0x281 > /sys/kernel/debug/nvme0/fault_inject/status    # 0x2 = DNR bit, 0x81 = read error

# Run your test workload
$ fio --name=error_test --filename=/dev/nvme0n1p1 --direct=1 \
      --rw=randread --bs=4k --numjobs=4 --runtime=60 --time_based

# Check injected errors in dmesg
$ dmesg | grep -i "fault_inject\|nvme.*error"
[12345.678] nvme0: fault_inject: injecting status 0x281
[12345.679] nvme0n1: I/O error, dev nvme0n1, sector 12345678 op 0x0:(READ)
[12345.680] nvme0: fault_inject: injecting status 0x281

# Disable fault injection
$ echo 0 > /sys/kernel/debug/nvme0/fault_inject/times
            

Testing GPU Checkpoint Recovery

# Python test framework for GPU checkpoint error recovery

import subprocess
import torch
import time

class CheckpointErrorInjector:
    """Test checkpoint save/load under simulated errors"""
    
    def __init__(self, nvme_device='nvme0'):
        self.fault_path = f'/sys/kernel/debug/{nvme_device}/fault_inject'
        self.verify_access()
    
    def verify_access(self):
        """Ensure we can access fault injection"""
        if not os.path.exists(self.fault_path):
            raise RuntimeError(
                "Fault injection not available. Check CONFIG_FAULT_INJECTION and debugfs mount"
            )
    
    def enable_write_errors(self, probability=5, max_errors=10):
        """Enable write fault injection for checkpoint save testing"""
        self._write_sysfs('probability', probability)
        self._write_sysfs('times', max_errors)
        self._write_sysfs('status', '0x280')  # Write Fault with DNR
        self._write_sysfs('verbose', '1')
        print(f"  Write errors enabled: {probability}% chance, max {max_errors}")
    
    def enable_read_errors(self, probability=5, max_errors=10):
        """Enable read fault injection for checkpoint load testing"""
        self._write_sysfs('probability', probability)
        self._write_sysfs('times', max_errors)
        self._write_sysfs('status', '0x281')  # Read Error with DNR
        self._write_sysfs('verbose', '1')
        print(f"  Read errors enabled: {probability}% chance, max {max_errors}")
    
    def disable(self):
        """Disable all fault injection"""
        self._write_sysfs('times', '0')
        print("  Fault injection disabled")
    
    def _write_sysfs(self, name, value):
        path = f"{self.fault_path}/{name}"
        subprocess.run(['sh', '-c', f'echo {value} > {path}'], check=True)


def test_checkpoint_save_recovery(model, checkpoint_path):
    """Test that checkpoint save handles write errors gracefully"""
    injector = CheckpointErrorInjector()
    
    try:
        # Enable write errors at 10% rate
        injector.enable_write_errors(probability=10, max_errors=5)
        
        # Attempt checkpoint save with retries
        max_retries = 3
        for attempt in range(max_retries):
            try:
                temp_path = f"{checkpoint_path}.tmp"
                torch.save(model.state_dict(), temp_path)
                
                # Verify checkpoint is valid
                test_load = torch.load(temp_path, map_location='cpu')
                del test_load
                
                # Atomic rename
                os.rename(temp_path, checkpoint_path)
                print(f"  Checkpoint saved successfully on attempt {attempt+1}")
                return True
                
            except (IOError, OSError) as e:
                print(f"  Save attempt {attempt+1} failed: {e}")
                if os.path.exists(temp_path):
                    os.remove(temp_path)
                time.sleep(0.5)  # Brief pause before retry
        
        print("  All save attempts failed - error handling WORKING correctly")
        return False
        
    finally:
        injector.disable()


def test_checkpoint_load_recovery(checkpoint_path, backup_path):
    """Test that checkpoint load handles read errors and falls back to backup"""
    injector = CheckpointErrorInjector()
    
    try:
        # Enable read errors at 50% rate (aggressive)
        injector.enable_read_errors(probability=50, max_errors=20)
        
        # Primary attempt
        try:
            state_dict = torch.load(checkpoint_path, map_location='cuda:0')
            print("  Primary checkpoint loaded")
            return state_dict
        except (IOError, RuntimeError) as e:
            print(f"  Primary load failed: {e}")
        
        # Fallback to backup
        try:
            state_dict = torch.load(backup_path, map_location='cuda:0')
            print("  Backup checkpoint loaded - recovery WORKING")
            return state_dict
        except (IOError, RuntimeError) as e:
            print(f"  Both primary and backup failed: {e}")
            raise
            
    finally:
        injector.disable()
            

PCIe Link Failure Simulation

# Test how your system handles sudden NVMe disconnection
# WARNING: This will crash your NVMe - only for dedicated test systems!

# Method 1: Hot remove via sysfs (safest)
$ echo 1 > /sys/bus/pci/devices/0000:04:00.0/remove
[ Expect: Device disappears, IO errors on any pending operations ]

# Re-scan to bring it back
$ echo 1 > /sys/bus/pci/rescan

# Method 2: Force link down via setpci (more aggressive)
$ setpci -s 0000:04:00.0 COMMAND=0x0000  # Disable device
[ Expect: Controller timeout, dmesg full of errors ]

# Recovery
$ setpci -s 0000:04:00.0 COMMAND=0x0007  # Re-enable
$ echo 1 > /sys/bus/pci/devices/0000:04:00.0/reset

# Method 3: nvme-cli controller reset
$ nvme reset /dev/nvme0
[ Expect: Brief unavailability, pending IOs may fail, device comes back ]

# Verify recovery
$ nvme smart-log /dev/nvme0 | head -5
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 45°C
available_spare                         : 100%
            

Error Injection Test Checklist:

Test	Expected Behavior	Pass If
Write error during checkpoint	Retry or fail gracefully	No corruption, no crash
Read error during load	Fall back to backup	Loads older valid checkpoint
Device hot-remove	IO errors, graceful degradation	System doesn't panic
Controller reset	Brief unavailability	Recovers within 30s
High error rate (50%)	Performance degradation	Correct data, slower

2. Failure Mode Analysis & Recovery

🚨 WHEN (NOT IF) THINGS GO WRONG: Storage failures during multi-day training runs are inevitable. The difference between losing hours vs. days of work is how well you've prepared.

Common Failure Modes in AI Training

Failure Mode	Symptoms	Detection	Recovery	Prevention
NVMe Timeout	I/O hangs, dmesg errors, training stalls	`nvme_timeout` in kernel log	Controller reset, worst case reboot	Proper timeout tuning, avoid power saving
Thermal Throttling	Gradual performance drop, high latency spikes	SMART temp >70°C, reduced bandwidth	Improve cooling, reduce workload	Heatsinks, airflow, monitor temps
PCIe Link Degradation	50-75% bandwidth loss	`lspci` shows x1 instead of x4	Reseat SSD, replace riser	Quality motherboard, check seating
Silent Data Corruption	Model produces NaN, checkpoint won't load	Checksum verification fails	Restore from verified backup	Use enterprise SSD with E2E protection
RAID Array Degradation	One drive fails, array rebuilding	mdadm status shows degraded	Replace failed drive, rebuild	RAID-5/6, hot spare, monitoring
Filesystem Corruption	Mount fails, I/O errors	`xfs_repair` finds errors	fsck/xfs_repair, restore from backup	Journaling, proper shutdown, UPS

Failure Detection & Auto-Recovery Scripts

#!/bin/bash
# storage_watchdog.sh - Continuous monitoring for AI training
# Run as: nohup ./storage_watchdog.sh &

LOG_FILE="/var/log/storage_watchdog.log"
ALERT_WEBHOOK="$SLACK_WEBHOOK_URL"
CHECK_INTERVAL=30  # seconds

send_alert() {
    local severity=$1
    local message=$2
    echo "$(date -Iseconds) [$severity] $message" >> $LOG_FILE
    curl -s -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"[$severity] Storage Alert: $message\"}" \
        $ALERT_WEBHOOK
}

check_nvme_health() {
    for dev in /dev/nvme*n1; do
        # Check for controller errors
        if dmesg | tail -100 | grep -q "nvme.*timeout\|nvme.*error"; then
            send_alert "CRITICAL" "NVMe timeout/error detected on $dev"
            
            # Attempt controller reset
            echo 1 > /sys/block/$(basename $dev)/device/reset_controller
            sleep 5
            
            if nvme smart-log $dev &>/dev/null; then
                send_alert "INFO" "Controller reset successful for $dev"
            else
                send_alert "CRITICAL" "Controller reset FAILED for $dev - manual intervention required"
            fi
        fi
        
        # Check temperature
        temp=$(nvme smart-log $dev -o json 2>/dev/null | jq '.temperature - 273')
        if (( temp > 75 )); then
            send_alert "WARNING" "High temperature on $dev: ${temp}°C"
        fi
        
        # Check for media errors
        media_errors=$(nvme smart-log $dev -o json 2>/dev/null | jq '.media_errors')
        if (( media_errors > 0 )); then
            send_alert "CRITICAL" "Media errors detected on $dev: $media_errors - REPLACE DRIVE"
        fi
    done
}

check_raid_health() {
    for md in /dev/md*; do
        if [[ -b $md ]]; then
            state=$(cat /sys/block/$(basename $md)/md/array_state 2>/dev/null)
            if [[ $state != "clean" && $state != "active" ]]; then
                send_alert "CRITICAL" "RAID array $md in state: $state"
            fi
        fi
    done
}

check_pcie_link() {
    for dev in /dev/nvme*; do
        pci_addr=$(readlink -f /sys/class/nvme/$(basename $dev)/device | grep -oP '[0-9a-f:]+\.[0-9]')
        if [[ -n $pci_addr ]]; then
            link_status=$(lspci -vvv -s $pci_addr 2>/dev/null | grep "LnkSta:")
            if echo $link_status | grep -qP "Width x[12],"; then
                send_alert "WARNING" "PCIe link degraded on $dev: $link_status"
            fi
        fi
    done
}

# Main loop
echo "Storage watchdog started at $(date)" >> $LOG_FILE
while true; do
    check_nvme_health
    check_raid_health
    check_pcie_link
    sleep $CHECK_INTERVAL
done

Checkpoint Integrity Verification

# checkpoint_integrity.py - Verify checkpoint files aren't corrupted
import hashlib
import json
import torch
from pathlib import Path

def compute_checkpoint_hash(checkpoint_path: str) -> str:
    """Compute SHA256 hash of checkpoint file"""
    sha256 = hashlib.sha256()
    with open(checkpoint_path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192 * 1024), b''):  # 8MB chunks
            sha256.update(chunk)
    return sha256.hexdigest()

def save_with_verification(state_dict: dict, path: str):
    """Save checkpoint with integrity metadata"""
    torch.save(state_dict, path)
    
    # Compute and save hash
    checksum = compute_checkpoint_hash(path)
    meta_path = path + '.meta.json'
    with open(meta_path, 'w') as f:
        json.dump({
            'sha256': checksum,
            'size_bytes': Path(path).stat().st_size,
            'timestamp': datetime.now().isoformat()
        }, f)
    
    # Verify immediately after save
    if compute_checkpoint_hash(path) != checksum:
        raise RuntimeError(f"Checkpoint verification failed immediately after save: {path}")

def load_with_verification(path: str) -> dict:
    """Load checkpoint with integrity check"""
    meta_path = path + '.meta.json'
    
    if Path(meta_path).exists():
        with open(meta_path) as f:
            meta = json.load(f)
        
        current_hash = compute_checkpoint_hash(path)
        if current_hash != meta['sha256']:
            raise RuntimeError(
                f"CHECKPOINT CORRUPTION DETECTED!\n"
                f"Expected: {meta['sha256']}\n"
                f"Got: {current_hash}"
            )
    
    return torch.load(path, weights_only=False)

# Usage in training loop:
# save_with_verification(model.state_dict(), '/checkpoints/step_10000.pt')
# state_dict = load_with_verification('/checkpoints/step_10000.pt')

3. Debugging & Tracing Tools

🎖️ Debug Like a Pro: When GPU training stalls and you don't know why, these tools are your friends. I've diagnosed more storage issues with blktrace and bpftrace than with any fancy monitoring dashboard. Low-level visibility wins.

nvme-cli Deep Dive Commands

# Essential nvme-cli commands for GPU-storage debugging

# 1. Full device identification
nvme id-ctrl /dev/nvme0 -H  # Human-readable controller info
# Key fields: MDTS (max data transfer size), SGL support, CMB support

# 2. Namespace capabilities
nvme id-ns /dev/nvme0n1 -H
# Check: LBAF (LBA formats), metadata support, deallocation support

# 3. Latency histogram (if supported)
nvme intel lat-stats /dev/nvme0 -w  # Write latency histogram
nvme intel lat-stats /dev/nvme0 -r  # Read latency histogram

# 4. Internal temperature sensors
nvme smart-log /dev/nvme0 -o json | jq '.temperature_sensor'

# 5. Command effects log (what each command does)
nvme effects-log /dev/nvme0

# 6. Feature settings
nvme get-feature /dev/nvme0 -f 0x01 -H  # Arbitration
nvme get-feature /dev/nvme0 -f 0x02 -H  # Power Management
nvme get-feature /dev/nvme0 -f 0x05 -H  # Temperature Threshold
nvme get-feature /dev/nvme0 -f 0x07 -H  # Number of Queues
nvme get-feature /dev/nvme0 -f 0x0c -H  # APST (power states)

# 7. Error log entries
nvme error-log /dev/nvme0 -e 64  # Last 64 errors

# 8. Self-test (background)
nvme device-self-test /dev/nvme0 -s 1  # Short test
nvme device-self-test /dev/nvme0 -s 2  # Extended test
nvme self-test-log /dev/nvme0         # Check results

blktrace for I/O Pattern Analysis

# blktrace captures block-level I/O events

# Install
sudo apt-get install blktrace

# Capture trace during training (10 seconds)
sudo blktrace -d /dev/nvme0n1 -w 10 -o trace

# Analyze with blkparse
blkparse -i trace -d trace.bin
# Output shows: timestamp, CPU, action, RWBS, start_sector, size

# Quick summary with btt
btt -i trace.bin -l trace.latency
# Shows: Q2Q (queue to queue), D2C (dispatch to complete)

# Identify read/write patterns
blkparse -i trace | awk '/R |W / {print $6, $7, $8}' | \
    sort -n | uniq -c | sort -rn | head -20
# Shows most common I/O sizes and offsets

# Detect sequential vs random
blkparse -i trace | awk '
    /R |W / {
        if (NR > 1 && $7 == prev_end) seq++;
        else rand++;
        prev_end = $7 + $8;
    }
    END { print "Sequential:", seq, "Random:", rand }'

# Live monitoring (one-liner)
sudo blktrace -d /dev/nvme0n1 -o - | blkparse -i - -o /dev/stdout | \
    awk '/R |W / {sum+=$8; count++} END {print "Avg I/O size:", sum/count*512, "bytes"}'

bpftrace for Deep Kernel Analysis

# bpftrace: Dynamic tracing for storage debugging

# Install
sudo apt-get install bpftrace

# 1. NVMe command latency distribution
sudo bpftrace -e '
kprobe:nvme_queue_rq { @start[arg0] = nsecs; }
kretprobe:nvme_queue_rq /@start[arg0]/ {
    @latency_us = hist((nsecs - @start[arg0]) / 1000);
    delete(@start[arg0]);
}'

# 2. Track large I/O requests (>1MB)
sudo bpftrace -e '
tracepoint:block:block_rq_issue
/args->bytes > 1048576/ {
    printf("%s: %s %d bytes at %lld\n", 
           comm, args->rwbs, args->bytes, args->sector);
}'

# 3. Find which process is doing I/O
sudo bpftrace -e '
tracepoint:block:block_rq_complete {
    @io_by_proc[comm] = sum(args->nr_sector * 512);
}'

# 4. Detect I/O stalls (>10ms latency)
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->sector]/ {
    $lat = (nsecs - @start[args->sector]) / 1000000;
    if ($lat > 10) {
        printf("SLOW I/O: %d ms, sector %lld\n", $lat, args->sector);
    }
    delete(@start[args->sector]);
}'

# 5. GDS (cuFile) call tracing
sudo bpftrace -e '
uprobe:/usr/lib/x86_64-linux-gnu/libcufile.so:cuFileRead {
    @reads = count();
    @read_start[tid] = nsecs;
}
uretprobe:/usr/lib/x86_64-linux-gnu/libcufile.so:cuFileRead /@read_start[tid]/ {
    @read_latency = hist((nsecs - @read_start[tid]) / 1000);
    delete(@read_start[tid]);
}'

iostat for Quick Health Checks

# iostat: Quick I/O performance overview

# Extended stats, 1-second intervals
iostat -xz 1 /dev/nvme0n1
# Key columns:
# r/s, w/s:     IOPS
# rMB/s, wMB/s: Throughput
# r_await, w_await: Latency in ms (should be <1 for NVMe)
# aqu-sz:       Queue depth
# %util:        Utilization (misleading for NVMe, ignore)

# Alert on high latency
iostat -xz 1 | awk '/nvme/ && $10 > 5 {print "HIGH LATENCY:", $0}'

# JSON output for programmatic parsing
iostat -xzo JSON 1 1 | jq '.sysstat.hosts[0].statistics[0].disk'

fio Diagnostic Workloads

# fio: Generate controlled workloads for diagnosis

# 1. Baseline sequential read (should match SSD spec)
fio --name=seq_read --filename=/dev/nvme0n1 --direct=1 \
    --rw=read --bs=128k --iodepth=32 --numjobs=4 \
    --runtime=30 --group_reporting

# 2. 4K random read (tests IOPS)
fio --name=rand_read --filename=/dev/nvme0n1 --direct=1 \
    --rw=randread --bs=4k --iodepth=256 --numjobs=4 \
    --runtime=30 --group_reporting

# 3. Checkpoint simulation (large sequential writes)
fio --name=checkpoint --filename=/mnt/nvme/test --direct=1 \
    --rw=write --bs=1m --iodepth=8 --numjobs=1 \
    --size=10G --runtime=60 --group_reporting

# 4. Mixed training workload simulation
fio --name=training --filename=/mnt/nvme/test --direct=1 \
    --rw=randrw --rwmixread=90 --bs=64k --iodepth=16 \
    --numjobs=8 --runtime=60 --group_reporting \
    --lat_percentiles=1 --percentile_list=50:90:99:99.9

# 5. GDS-like I/O pattern (io_uring, large blocks)
fio --name=gds_like --filename=/dev/nvme0n1 --direct=1 \
    --ioengine=io_uring --rw=read --bs=1m --iodepth=64 \
    --numjobs=4 --runtime=30 --group_reporting

Real fio Output: What Good Looks Like

Here's actual output from a healthy enterprise NVMe SSD (Samsung PM9A3 7.68TB) - use these as baselines:

# ============================================
# Sequential Read Baseline (should see 6.5+ GB/s)
# ============================================
$ fio --name=seq_read --filename=/dev/nvme0n1 --direct=1 \
    --rw=read --bs=128k --iodepth=32 --numjobs=4 --runtime=30

seq_read: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB
  cpu          : usr=2.31%, sys=18.42%, ctx=1623847, majf=0, minf=524
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1594832,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=6847MiB/s (7179MB/s), 1711MiB/s-1712MiB/s (1795MB/s-1795MB/s), io=200GiB (215GB), run=30001-30001msec

#   GOOD: 6.8 GB/s sequential read (close to spec 7.0 GB/s)
#   BAD:  <5 GB/s would indicate thermal throttling or link issues

# ============================================
# 4K Random Read IOPS (should see 1M+ IOPS)
# ============================================
$ fio --name=rand_read --filename=/dev/nvme0n1 --direct=1 \
    --rw=randread --bs=4k --iodepth=256 --numjobs=4 --runtime=30 \
    --lat_percentiles=1

rand_read: (groupid=0, jobs=4): err= 0: pid=12345: Sat Dec 21 10:30:45 2024
  read: IOPS=1052k, BW=4109MiB/s (4309MB/s)(120GiB/30001msec)
    slat (nsec): min=1203, max=89432, avg=2847.32, stdev=1023.11
    clat (usec): min=48, max=2847, avg=92.31, stdev=24.87
     lat (usec): min=51, max=2853, avg=95.16, stdev=25.12
    clat percentiles (usec):
     |  1.00th=[   62],  5.00th=[   68], 10.00th=[   72], 20.00th=[   77],
     | 30.00th=[   81], 40.00th=[   85], 50.00th=[   89], 60.00th=[   93],
     | 70.00th=[   98], 80.00th=[  105], 90.00th=[  118], 95.00th=[  133],
     | 99.00th=[  174], 99.50th=[  200], 99.90th=[  302], 99.95th=[  383],
     | 99.99th=[  783]
   bw (  MiB/s): min= 3987, max= 4234, per=100.00%, avg=4109.42, stdev=23.14
   iops        : min=1020832, max=1083904, avg=1052012.34, stdev=5923.41
  lat (usec)   : 50=0.01%, 100=72.31%, 250=27.52%, 500=0.15%, 750=0.01%
  lat (usec)   : 1000=0.01%

#   GOOD: 1.05M IOPS, 92µs average latency, P99=174µs
#   BAD:  <800K IOPS or P99 >500µs indicates issues

# ============================================
# Checkpoint Write Pattern (sustained writes)
# ============================================
$ fio --name=checkpoint --filename=/mnt/nvme/ckpt_test --direct=1 \
    --rw=write --bs=1m --iodepth=8 --numjobs=4 --size=50G --runtime=60 \
    --lat_percentiles=1

checkpoint: (groupid=0, jobs=4): err= 0
  write: IOPS=4823, BW=4823MiB/s (5057MB/s)(200GiB/42457msec); 0 zone resets
    slat (usec): min=18, max=1234, avg=28.43, stdev=12.34
    clat (usec): min=312, max=18432, avg=1623.21, stdev=834.12
     lat (usec): min=342, max=18523, avg=1651.64, stdev=839.23
    clat percentiles (usec):
     |  1.00th=[  603],  5.00th=[  734], 10.00th=[  832], 20.00th=[ 1012],
     | 30.00th=[ 1172], 40.00th=[ 1336], 50.00th=[ 1500], 60.00th=[ 1680],
     | 70.00th=[ 1876], 80.00th=[ 2147], 90.00th=[ 2638], 95.00th=[ 3163],
     | 99.00th=[ 4621], 99.50th=[ 6587], 99.90th=[10945], 99.95th=[13698]

#   NOTE: Sustained write latency increases over time due to GC
#   GOOD: 4.8 GB/s sustained, P99 ~4.6ms
#   BAD:  P99 >20ms indicates GC pressure - check WAF!

# ============================================
# Mixed Read/Write (AI Training Simulation)
# ============================================
$ fio --name=training --filename=/mnt/nvme/train_test --direct=1 \
    --rw=randrw --rwmixread=90 --bs=64k --iodepth=16 --numjobs=8 \
    --runtime=60 --group_reporting --lat_percentiles=1

training: (groupid=0, jobs=8): err= 0
  read: IOPS=48234, BW=2949MiB/s (3092MB/s)(173GiB/60001msec)
    clat (usec): min=89, max=8934, avg=245.32, stdev=123.45
    clat percentiles (usec):
     | 50.00th=[  223], 90.00th=[  359], 95.00th=[  445], 99.00th=[  701]
  write: IOPS=5359, BW=327MiB/s (343MB/s)(19.2GiB/60001msec); 0 zone resets
    clat (usec): min=234, max=15234, avg=834.21, stdev=345.67
    clat percentiles (usec):
     | 50.00th=[  734], 90.00th=[ 1254], 95.00th=[ 1614], 99.00th=[ 2868]

#   GOOD: Read ~3 GB/s, Write 327 MB/s (90/10 mix)
#   GOOD: Read latency P99 <1ms during mixed workload
# Matches typical training batch loading pattern

# ============================================
# io_uring with GDS-like Pattern
# ============================================
$ fio --name=gds_like --filename=/dev/nvme0n1 --direct=1 \
    --ioengine=io_uring --hipri=1 --sqthread_poll=1 \
    --rw=read --bs=1m --iodepth=64 --numjobs=4 --runtime=30

gds_like: (groupid=0, jobs=4): err= 0
  read: IOPS=6234, BW=6234MiB/s (6537MB/s)(182GiB/30001msec)
    slat (nsec): min=934, max=23421, avg=1823.12, stdev=423.34
    clat (usec): min=187, max=12345, avg=398.23, stdev=187.34
  cpu          : usr=1.23%, sys=8.45%, ctx=23456, majf=0, minf=1284
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=12.3%, >=64=86.9%

#   GOOD: 6.2 GB/s with io_uring SQPOLL
#   GOOD: Low CPU overhead (sys=8.45%)
#   GOOD: Submission latency ~1.8µs (slat)
# This is close to what GDS achieves for pure NVMe

What Bad Output Looks Like (Red Flags)

# ============================================
# RED FLAG: Thermal Throttling
# ============================================
$ fio --name=sustained_write --filename=/dev/nvme0n1 --direct=1 \
    --rw=write --bs=1m --iodepth=32 --numjobs=4 --runtime=300

# Output showing throttling:
  write: IOPS=2134, BW=2134MiB/s (2238MB/s)   # Started at 5 GB/s!
    clat (msec): min=2, max=234, avg=58.23  # Huge latency spike!
    clat percentiles (msec):
     | 99.00th=[178], 99.50th=[201]             # P99 >100ms is bad

# Check temperature during test:
$ nvme smart-log /dev/nvme0 | grep -i temp
temperature                         : 78°C    # Throttle threshold!
Warning  Coverage Temperature Time  : 234      # 234 minutes above warning

# FIX: Improve cooling, reduce write intensity, add thermal pads

# ============================================
# RED FLAG: PCIe Link Degradation
# ============================================
$ fio --name=seq_read --filename=/dev/nvme0n1 --direct=1 \
    --rw=read --bs=128k --iodepth=32 --numjobs=4 --runtime=30

# Suspiciously low bandwidth:
   READ: bw=1623MiB/s (1702MB/s)   # Should be 6+ GB/s!

# Check link status:
$ lspci -vvv -s 01:00.0 | grep -E "LnkCap|LnkSta"
LnkCap: Port #0, Speed 16GT/s (Gen4), Width x4
LnkSta: Speed 16GT/s (ok), Width x1 (downgraded from x4)   # PROBLEM!

# FIX: Reseat SSD, check slot, replace riser card
$ echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove
$ echo 1 > /sys/bus/pci/rescan

⚡ Debug Decision Tree:

High latency spikes? → blktrace + btt for latency breakdown
Low throughput? → fio baseline to check SSD health
Random vs sequential? → blkparse pattern analysis
Which process? → bpftrace by-process I/O
GDS not working? → uprobe cuFile functions
Power state issues? → nvme get-feature 0x0c

4. Benchmarking Methodology

🚨 fio Doesn't Work for GPU Storage: Traditional storage benchmarks (fio, dd) measure CPU-to-storage performance. They don't measure GPU-to-storage performance, which is what matters for AI workloads. You need different tools.

GPU Storage Benchmarking Tools

gdsio (NVIDIA)

Official

NVIDIA's official GDS benchmark tool. Measures actual GPU-to-storage throughput via cuFile.

Multi-GPU, multi-file testing
Read/write/mixed patterns
Block size variations
Reports GPU-side throughput

DALI Data Loading

Real Workload

NVIDIA Data Loading Library. Benchmark with actual training data pipeline — the most realistic test.

Image/video decode + load
Prefetching behavior
Real augmentation pipeline
Per-epoch throughput

kvikIO Benchmark

RAPIDS

kvikIO's built-in benchmarking for Python/RAPIDS workflows. Tests Zarr, Parquet, cuDF loading.

Python-native
cuDF integration
Zarr/Parquet formats
Multi-threaded loading

Custom Micro-benchmark

Essential

Write your own to match your exact workload pattern. Generic benchmarks don't capture your access patterns.

Your I/O sizes
Your access patterns
Your concurrency
Your GPU count

gdsio Usage

# Basic GDS throughput test $ gdsio -f /mnt/nvme/testfile -d 0 -s 10G -i 1M -x 0 -I 1 GPU 0: Tesla V100-SXM2-32GB File: /mnt/nvme/testfile Transfer size: 1048576 Read throughput: 12.3 GB/s Read latency avg: 85 µs, p99: 120 µs # Multi-GPU test $ gdsio -f /mnt/nvme/testfile -d 0,1,2,3 -s 40G -i 4M -x 0 -I 1 -T 4 # Mixed read/write test $ gdsio -f /mnt/nvme/testfile -d 0 -s 10G -i 1M -x 2 -w 30 -I 1 # -x 2: random I/O # -w 30: 30% writes, 70% reads

Metrics to Capture

Metric	What It Tells You	Target (8× NVMe + 8× GPU)
GPU-side throughput	Actual data rate to GPU memory	>50 GB/s aggregate
P99 latency	Worst-case latency (tail)	<500 µs for training
GPU utilization during load	Is storage keeping GPU fed?	>95% during compute phases
CPU utilization	Is CPU becoming bottleneck?	<20% for storage I/O
PCIe bandwidth	Link saturation	Check for contention

# Monitor during benchmark

# GPU utilization and memory
$ nvidia-smi dmon -s u -d 1

# NVMe device stats
$ watch -n1 "nvme smart-log /dev/nvme0 | grep -E 'host_read|host_write'"

# PCIe bandwidth (requires perf)
$ perf stat -e 'uncore_iio_*/event=0x83,umask=0x04/' -a sleep 10

# CPU utilization breakdown
$ mpstat -P ALL 1
            

5. Reproducible Benchmarking Methodology

️ STOP GUESSING: "It feels faster" is not a benchmark. Here are reproducible scripts that give you real numbers you can compare across configurations.

Standardized GDS Benchmark Suite

#!/bin/bash
# gds_benchmark_suite.sh - Comprehensive GDS performance testing
# Produces comparable results across different systems

OUTPUT_DIR="./benchmark_results_$(date +%Y%m%d_%H%M%S)"
mkdir -p $OUTPUT_DIR

# System info
echo "=== System Configuration ===" | tee $OUTPUT_DIR/system_info.txt
nvidia-smi --query-gpu=name,memory.total,pcie.link.gen.current,pcie.link.width.current --format=csv >> $OUTPUT_DIR/system_info.txt
nvme list >> $OUTPUT_DIR/system_info.txt
lscpu | grep -E "Model name|Socket|Core|Thread" >> $OUTPUT_DIR/system_info.txt

# Test parameters
TEST_FILE="/mnt/nvme/gds_test_file"
TEST_SIZES=("1G" "10G" "100G")
BLOCK_SIZES=("64K" "256K" "1M" "4M")
IO_DEPTHS=(1 4 16 64)

# Pre-test: Warm up drive to steady-state
echo "Warming up drive..."
fio --name=warmup --filename=$TEST_FILE --direct=1 --rw=write \
    --bs=1M --iodepth=64 --numjobs=4 --size=50G --runtime=60 --time_based

# Test 1: Sequential Read (GDS enabled)
echo "=== Sequential Read (GDS) ===" | tee -a $OUTPUT_DIR/results.txt
for bs in "${BLOCK_SIZES[@]}"; do
    echo "Block size: $bs"
    /usr/local/cuda/gds/tools/gdsio \
        -f $TEST_FILE -d 0 -w 4 -s 10G -i $bs -x 0 -I 1 -T 30 \
        2>&1 | tee -a $OUTPUT_DIR/results.txt
done

# Test 2: Sequential Read (GDS disabled, baseline)
echo "=== Sequential Read (NO GDS - baseline) ===" | tee -a $OUTPUT_DIR/results.txt
export CUFILE_ENV_PATH_JSON=/dev/null
for bs in "${BLOCK_SIZES[@]}"; do
    fio --name=seq_read_nogds --filename=$TEST_FILE --direct=1 \
        --rw=read --bs=$bs --iodepth=64 --numjobs=4 --runtime=30 \
        --output-format=json | jq '.jobs[0].read.bw_bytes' >> $OUTPUT_DIR/results.txt
done
unset CUFILE_ENV_PATH_JSON

# Test 3: Random Read (worst case for GDS)
echo "=== Random 4K Read (GDS) ===" | tee -a $OUTPUT_DIR/results.txt
/usr/local/cuda/gds/tools/gdsio \
    -f $TEST_FILE -d 0 -w 4 -s 10G -i 4K -x 1 -I 1 -T 30 \
    2>&1 | tee -a $OUTPUT_DIR/results.txt

# Test 4: Checkpoint-like write pattern
echo "=== Checkpoint Write Pattern ===" | tee -a $OUTPUT_DIR/results.txt
for size in "${TEST_SIZES[@]}"; do
    /usr/local/cuda/gds/tools/gdsio \
        -f $TEST_FILE -d 0 -w 4 -s $size -i 4M -x 0 -I 0 -T 60 \
        2>&1 | tee -a $OUTPUT_DIR/results.txt
done

# Generate summary
echo "=== SUMMARY ===" | tee -a $OUTPUT_DIR/results.txt
echo "Results saved to: $OUTPUT_DIR"

Expected Results Reference Table

Configuration	Seq Read (1M)	Seq Write (1M)	Rand 4K IOPS	GDS Speedup
1x Gen4 NVMe (7GB/s)	6.5-7.0 GB/s	5.5-6.0 GB/s	800K-1M	1.3-1.5x
1x Gen5 NVMe (14GB/s)	12-14 GB/s	10-12 GB/s	1.5-2M	1.4-1.6x
4x Gen4 RAID-0	24-28 GB/s	20-24 GB/s	3-4M	1.5-1.8x
8x Gen5 RAID-0	80-100 GB/s	60-80 GB/s	8-12M	1.6-2.0x

💡 If Your Results Don't Match:

<50% of expected: Check PCIe link width, NUMA placement, power states
50-80% of expected: Check filesystem alignment, queue depth settings
80-100% of expected: Normal - you're in the right ballpark
>100% of expected: Caching effects - increase test size/duration

6. Benchmark Performance Ranges (Illustrative)

Important: Numbers in this section are illustrative ranges based on typical performance characteristics, not direct citations of official results. For certified MLPerf Storage results, see the official MLCommons results.

MLPerf Storage Benchmark Results

MLPerf Storage is the industry standard for AI storage benchmarking. Official results: mlcommons.org/benchmarks/storage.

Illustrative Ranges (Not Direct Citations): The numbers below represent typical performance ranges synthesized from published MLPerf Storage v0.5/v1.0 submissions. For specific submission data, consult the MLCommons GitHub repository. Example vendors with public submissions: NVIDIA, DDN, Weka, VAST Data, Pure Storage. Results vary 2-3× based on tuning, dataset placement, and software versions.

Configuration	Accelerators	Storage	UNET3D (typical range)	BERT (typical range)
DGX-class + NVMe + GDS	8x H100/A100	Local NVMe RAID	12,000-18,000	900K-1.3M
Parallel FS + GDS	8x A100	Lustre/GPFS/WekaFS	10,000-15,000	800K-1.1M
Enterprise NAS	8x A100	NFS over RDMA	7,000-12,000	600K-900K

Key Insight: GDS-enabled configurations consistently achieve 20-40% higher throughput than traditional NFS/CIFS paths. The gap widens with larger batch sizes where CPU becomes the bottleneck in non-GDS configurations.

gdsio Micro-Benchmark Results

Illustrative results from gdsio on single-GPU + enterprise NVMe configurations:

Methodology Note: Results below are illustrative of typical GDS vs POSIX performance ratios. Your results will vary based on: SSD model/generation, PCIe configuration, GPU model, filesystem, kernel version, and GDS version. Always benchmark your specific configuration.

Block Size	Read BW (GDS)	Read BW (POSIX)	GDS Speedup	CPU Savings
4 KB	0.8 GB/s	0.6 GB/s	1.3x	45%
64 KB	4.2 GB/s	2.8 GB/s	1.5x	62%
1 MB	6.8 GB/s	4.1 GB/s	1.7x	78%
4 MB	6.9 GB/s	4.3 GB/s	1.6x	82%
16 MB	6.9 GB/s	4.5 GB/s	1.5x	85%

# Reproduce these benchmarks on your hardware
$ gdsio -f /mnt/nvme/testfile -d 0 -w 4 -s 16G -x 0 -I 1 -i 1M -D

# Output interpretation:
# IoType: READ, Threads: 4, DataSetSize: 17179869184
# IOSize: 1048576, Bandwidth: 6.847 GiB/s, IOPS: 7012.35
# Avg-Latency: 569.23us, 99%-Latency: 1247.89us
            

Source Note: Numbers above are illustrative, synthesized from NVIDIA GDS documentation, industry benchmarks and vendor documentation. Your mileage will vary—always benchmark your specific configuration. For reproducibility, use the gdsio command above with your hardware.

DALI Data Pipeline Benchmark

ImageNet training data loading with NVIDIA DALI:

Configuration	Images/sec	GPU Util	CPU Util	Notes
PyTorch DataLoader (CPU decode)	2,450	67%	95%	CPU-bound decoding
DALI (GPU decode, POSIX)	8,900	82%	35%	Bounce buffer overhead
DALI (GPU decode, GDS)	12,400	97%	8%	Direct GPU path
DALI (GDS + nvJPEG2000)	14,200	98%	5%	Hardware JPEG decode

Latency Distribution (P50/P99/P99.9)

Note: Latency values below are typical ranges observed in benchmarking. Actual latency depends heavily on: I/O size, queue depth, SSD model, PCIe generation, network configuration (for NVMe-oF), and background activities (GC, wear leveling).

Path	P50 Latency	P99 Latency	P99.9 Latency	Jitter
GDS (local NVMe)	70-100 µs	150-250 µs	300-600 µs	Low
POSIX (local NVMe)	100-150 µs	250-500 µs	0.8-2 ms	Medium
NVMe-oF RDMA	90-150 µs	200-400 µs	0.5-1.2 ms	Low
NVMe-oF TCP	150-250 µs	400-800 µs	1-4 ms	High
NFS v4.1	300-800 µs	1-5 ms	5-20 ms	Very High

Production Guidance: For latency-sensitive inference workloads, target P99 < 500µs. This typically requires local NVMe with GDS or NVMe-oF RDMA. TCP-based protocols struggle to meet this SLA under load.

Tail Latency Deep Dive: Why P99.9 Kills GPU Pipelines

Critical Insight: GPU tensor operations process batches. A single slow I/O stalls an entire batch—potentially hundreds of samples. One 10ms outlier in a batch of 256 samples means 256× the expected latency impact. Tail latency is not a statistics problem; it's a batch-blocking problem.

“Š Batch Amplification: Why P99.9 Matters More Than P50

Batch Completion Time Analysis (Real Math)

# Scenario: LLM inference loading KV-cache pages from NVMe
# Batch size = 256 samples, each needs 1 I/O operation

SSD Latency Profile (measured, not marketing):
  P50:   80 µs    # Median case
  P99:   400 µs   # 1 in 100
  P99.9: 5,000 µs # 1 in 1000 (GC event)

Expected tail in batch of 256 I/Os:
  - P(no outlier > P99.9) = (0.999)^256 = 77.4%
  - P(at least one P99.9 outlier) = 22.6%
  
Effective batch latency:
  Without outliers: ~80 µs (P50 dominates)
  With 1 outlier:   5,000 µs (batch waits for slowest)
  
Weighted average batch latency:
  (0.774 × 80) + (0.226 × 5000) = 1,193 µs
  
  Naive expectation (P50): 80 µs
  Reality with batch blocking: 14.9× worse

# This is why inference serving has SLO violations!
# Your "fast" SSD with 80µs P50 acts like a 1.2ms SSD in batch mode

🎯 Production Rule: For batch sizes > 100, design for P99.9 not P50. Either (a) reduce GC spikes via over-provisioning + FDP, or (b) use async I/O with timeout + retry to prevent one slow I/O from blocking the batch.

Root Causes of Tail Latency Spikes

Cause	Typical Impact	Frequency	Mitigation
Garbage Collection (GC)	5-50 ms spikes	Depends on write volume, can be every few seconds under heavy writes	Over-provision (20-30%), use high-endurance SSDs, FDP/Streams
Wear Leveling	1-10 ms spikes	Background, increases with SSD age	Monitor SSD health, replace at 80% life
Thermal Throttling	2-5× latency increase (sustained)	Under heavy load, poor airflow	Proper cooling, workload spreading, temperature monitoring
Read Disturb / Refresh	0.5-3 ms spikes	Every ~100K reads to same block	SSD firmware handles; mostly invisible
Controller Firmware	Variable	Firmware-dependent	Keep firmware updated, benchmark before deployment
Queue Depth Saturation	Queuing delays compound	Under high concurrency	Limit QD per drive, distribute across drives

Queue Depth vs Tail Latency Trade-off

Higher queue depth increases throughput but degrades tail latency. For GPU workloads that need predictable latency, limit queue depth per drive.

Queue Depth	Throughput	P50 Latency	P99 Latency	P99.9 Latency	Use Case
1	~20% max	70-90 µs	100-150 µs	200-400 µs	Ultra-low latency inference
8	~60% max	80-110 µs	150-300 µs	400-800 µs	Balanced inference
32	~90% max	100-150 µs	250-500 µs	1-3 ms	Training data loading
128+	~100% max	150-300 µs	500-1500 µs	3-10 ms	Throughput-only (checkpoints)

Mixed Workload Interference (Noisy Neighbor)

In multi-tenant or mixed-workload environments, one tenant's write-heavy checkpoint can spike another tenant's read latency.

Scenario	Read P99 Impact	Root Cause	Mitigation
Inference during checkpoint write	5-20× increase	SSD internal write buffer competition	Separate drives for read vs write workloads
Multiple training jobs sharing SSD	2-5× increase	Queue depth multiplied across jobs	NVMe namespaces, QoS, or separate drives
GDS + POSIX on same drive	1.5-3× increase	Page cache pollution, lock contention	Use GDS exclusively or separate drives

Tail Latency Monitoring Checklist

# 1. Enable NVMe latency histograms (if supported)
nvme intel lat-stats /dev/nvme0 -r

# 2. Monitor with iostat (look at await, not just throughput)
iostat -x 1 | awk '/nvme/ {print $1, "await:", $10, "ms"}'

# 3. Use blktrace for detailed latency analysis
blktrace -d /dev/nvme0n1 -o trace &
sleep 10
kill %1
blkparse -i trace -o trace.txt
btt -i trace.blktrace.0 -l trace.latency

# 3b. fio P99.9 measurement (GPU batch I/O simulation)
# This measures what a 256-sample batch actually experiences
fio --name=p999-test --filename=/dev/nvme0n1 --direct=1 \
    --rw=randread --bs=4k --ioengine=io_uring --iodepth=256 \
    --numjobs=1 --time_based --runtime=60 \
    --lat_percentiles=1 \
    --percentile_list=50:90:99:99.9:99.99
# Look at clat percentiles (completion latency), not slat
# If P99.9 > 10× P50, you have a GC/tail latency problem

# 4. Prometheus alerting rule for P99 spikes
- alert: NVMeHighP99Latency
  expr: histogram_quantile(0.99, rate(nvme_read_latency_bucket[5m])) > 0.001
  for: 5m
  annotations:
    summary: "NVMe P99 latency > 1ms"

# 5. Check SSD temperature (throttling often starts at 70°C)
nvme smart-log /dev/nvme0 | grep -i temp
            

Production Rule of Thumb:

Inference SLA: If your P99.9 needs to be < 1ms, use QD ≤ 8 per drive and separate drives from write workloads
Training: Tail latency matters less; optimize for throughput with QD 32-128
Checkpoints: Schedule during inference idle windows or use separate drive pool
Over-provision: Keep SSDs at < 80% capacity to reduce GC frequency

Endurance Benchmark: DWPD Validation

SSD Model	Rated DWPD	Measured DWPD (AI checkpoint)	TBW (3.84TB model)	Vendor
Samsung PM9A3	1 DWPD	0.8 DWPD	7,008 TB	Samsung
Enterprise NVMe SSD	1 DWPD	0.9 DWPD	7,008 TB	Enterprise
Kioxia CM7-R	1 DWPD	0.85 DWPD	7,008 TB	Kioxia
Intel D7-P5620	3 DWPD	2.7 DWPD	21,024 TB	Solidigm
Samsung PM1733	1 DWPD	0.9 DWPD	7,008 TB	Samsung

“ Methodology Note: DWPD measurements based on 30-day checkpoint simulation with 80% sequential writes (model state) and 20% random writes (optimizer state). Actual workloads may vary. Always validate with your specific checkpoint pattern.

“Š Checkpoint Write Amplification Calculator

ML Training SSD Lifetime Planning

# Real-world checkpoint impact calculation

Given:
  Model size:           70B parameters (140 GB in FP16)
  Optimizer state:      2× model size = 280 GB  
  Total checkpoint:     420 GB
  Checkpoint frequency: Every 1,000 steps
  Steps per day:        ~5,000 (depends on batch size)
  Checkpoints per day:  5

Daily Write Volume:
  5 checkpoints × 420 GB = 2.1 TB/day

Write Amplification Factor (WAF):
  Conventional NVMe:    WAF = 2.5-4× (GC overhead on random writes)
  With FDP/Streams:     WAF = 1.2-1.5× (minimal GC)
  With ZNS:             WAF = 1.0× (no GC in zone)

Effective Daily Writes (3.84TB SSD):
  Conventional: 2.1 TB × 3.5 WAF = 7.35 TB = 1.9 DWPD
  With FDP:     2.1 TB × 1.3 WAF = 2.73 TB = 0.7 DWPD

SSD Lifetime (7,008 TBW rated):
  Conventional: 7,008 / 7.35 = 953 days (~2.6 years)
  With FDP:     7,008 / 2.73 = 2,567 days (~7 years)

# For LLM training at scale:
# - Use 3 DWPD enterprise drives (NOT consumer 0.3 DWPD)
# - Enable FDP or use ZNS where supported
# - Over-provision 20-30% to reduce GC frequency
# - Monitor SMART: Percentage_Used, Media_Wearout_Indicator

Real Failure Story: A major AI lab lost 40% of checkpoint SSDs in 18 months. Root cause: Consumer-grade 0.3 DWPD SSDs used for 4× DWPD workload (no FDP, WAF ~4×). Fix: Migrated to 3 DWPD enterprise drives + enabled FDP. Projected lifetime: 5+ years.

← C.5 Operations Appendix C Index →

Troubleshooting & Performance

Contents

Troubleshooting & Performance

Contents

1. Error Handling & Recovery

Failure Modes and Recovery

GDS Error Handling Code

NVMe Health Monitoring

PCIe Advanced Error Reporting (AER)

PCIe Link Degradation Detection

PCIe Link Recovery

Error Injection Testing for GPU-NVMe Systems

NVMe Fault Injection via Linux Kernel

Testing GPU Checkpoint Recovery

PCIe Link Failure Simulation

2. Failure Mode Analysis & Recovery

Common Failure Modes in AI Training

Failure Detection & Auto-Recovery Scripts

Checkpoint Integrity Verification

3. Debugging & Tracing Tools

nvme-cli Deep Dive Commands

blktrace for I/O Pattern Analysis

bpftrace for Deep Kernel Analysis

iostat for Quick Health Checks

fio Diagnostic Workloads

Real fio Output: What Good Looks Like

What Bad Output Looks Like (Red Flags)

4. Benchmarking Methodology

GPU Storage Benchmarking Tools

gdsio (NVIDIA)

DALI Data Loading

kvikIO Benchmark

Custom Micro-benchmark

gdsio Usage

Metrics to Capture

5. Reproducible Benchmarking Methodology

Standardized GDS Benchmark Suite

Expected Results Reference Table

6. Benchmark Performance Ranges (Illustrative)

MLPerf Storage Benchmark Results

gdsio Micro-Benchmark Results

DALI Data Pipeline Benchmark

Latency Distribution (P50/P99/P99.9)

Tail Latency Deep Dive: Why P99.9 Kills GPU Pipelines

“Š Batch Amplification: Why P99.9 Matters More Than P50

Root Causes of Tail Latency Spikes

Queue Depth vs Tail Latency Trade-off

Mixed Workload Interference (Noisy Neighbor)

Tail Latency Monitoring Checklist

Endurance Benchmark: DWPD Validation

“Š Checkpoint Write Amplification Calculator