SSD Endurance, Security Hardening, Failure Modes, NUMA Topology, and Reproducible Benchmarking. The gaps that can kill your AI training infrastructure.
SSD endurance, namespace strategies, security hardening, power management, firmware updates, and production monitoring with Prometheus/Grafana.
| Workload Type | I/O Pattern | Typical WAF | DWPD Impact | Risk Level |
|---|---|---|---|---|
| Checkpoint Writes | Large sequential (GB-TB) | 1.0 - 1.2 | Low | Safe |
| Dataset Reads | Sequential reads | N/A (reads) | None | Safe |
| ZeRO-2 Gradient Offload | Mixed 64KB-1MB | 1.5 - 2.5 | Medium | Monitor |
| ZeRO-3 Param Offload | Random 4KB-256KB | 2.0 - 4.0 | High | Danger |
| ZeRO-Infinity | Random 4KB demand paging | 3.0 - 10.0 | Very High | 🔴 Critical |
| KV Cache Offload | Random small writes | 5.0 - 15.0 | Extreme | 🔴 Critical |
#!/usr/bin/env python3
"""
SSD Endurance Calculator for AI Training Workloads
Run this BEFORE deploying to estimate SSD lifespan
"""
def calculate_ssd_lifespan(
ssd_capacity_tb: float,
ssd_dwpd: float, # Drive Writes Per Day rating
ssd_warranty_years: float,
daily_checkpoint_gb: float,
checkpoint_waf: float = 1.1,
daily_zero_offload_gb: float = 0,
zero_waf: float = 3.0,
daily_kv_cache_gb: float = 0,
kv_waf: float = 8.0
) -> dict:
"""Calculate expected SSD lifespan for AI workloads"""
# Total NAND writes per day (accounting for WAF)
nand_writes_per_day_tb = (
(daily_checkpoint_gb * checkpoint_waf / 1024) +
(daily_zero_offload_gb * zero_waf / 1024) +
(daily_kv_cache_gb * kv_waf / 1024)
)
# SSD's rated endurance (TBW = Terabytes Written)
rated_tbw = ssd_capacity_tb * ssd_dwpd * 365 * ssd_warranty_years
# Expected lifespan
lifespan_days = rated_tbw / nand_writes_per_day_tb if nand_writes_per_day_tb > 0 else float('inf')
lifespan_months = lifespan_days / 30.44
# Effective DWPD being used
effective_dwpd = nand_writes_per_day_tb / ssd_capacity_tb
return {
'nand_writes_per_day_tb': round(nand_writes_per_day_tb, 2),
'effective_dwpd': round(effective_dwpd, 2),
'rated_dwpd': ssd_dwpd,
'rated_tbw': round(rated_tbw, 0),
'lifespan_days': round(lifespan_days, 0),
'lifespan_months': round(lifespan_months, 1),
'within_warranty': lifespan_months >= (ssd_warranty_years * 12),
'risk_level': 'LOW' if effective_dwpd < ssd_dwpd * 0.5 else
('MEDIUM' if effective_dwpd < ssd_dwpd else 'HIGH')
}
# Example: 70B model training with ZeRO-3
result = calculate_ssd_lifespan(
ssd_capacity_tb=3.84, # Samsung PM9A3 3.84TB
ssd_dwpd=1.0, # 1 DWPD rating
ssd_warranty_years=5,
daily_checkpoint_gb=500, # 500GB checkpoints/day
daily_zero_offload_gb=2000, # 2TB ZeRO-3 swapping/day
zero_waf=3.5,
daily_kv_cache_gb=0 # Training, not inference
)
print(f"""
SSD Endurance Analysis:
=======================
NAND Writes/Day: {result['nand_writes_per_day_tb']} TB
Effective DWPD: {result['effective_dwpd']} (rated: {result['rated_dwpd']})
Expected Life: {result['lifespan_months']} months
Risk Level: {result['risk_level']}
Within Warranty: {result['within_warranty']}
""")
# Output:
# NAND Writes/Day: 7.37 TB
# Effective DWPD: 1.92 (rated: 1.0) ← EXCEEDS RATING!
# Expected Life: 26.0 months ← Will fail before warranty
# Risk Level: HIGH
| Feature | Consumer (970 EVO) | Prosumer (990 PRO) | Enterprise (PM9A3) | AI-Optimized (D7-P5520) |
|---|---|---|---|---|
| DWPD Rating | 0.3 | 0.6 | 1.0 | 1.0 - 3.0 |
| DRAM Buffer | 512MB - 1GB | 1-2GB | 4-8GB | 8GB+ |
| Over-provisioning | ~7% | ~7% | ~28% | ~28% |
| Power Loss Protection | No | No | Yes (PLP) | Yes (PLP) |
| End-to-End Protection | No | Partial | Yes (T10 DIF) | Yes (T10 DIF) |
| ZeRO-3 Suitability | 6-12 months life | 12-18 months life | 3-5 years life | 5+ years life |
| Cost (3.84TB) | $200-300 | $300-400 | $400-600 | $600-900 |
#!/bin/bash
# ssd_health_monitor.sh - Run daily via cron
for dev in /dev/nvme*n1; do
echo "=== $dev ==="
# Get SMART data
smart=$(nvme smart-log $dev -o json)
# Critical metrics
pct_used=$(echo $smart | jq '.percent_used')
avail_spare=$(echo $smart | jq '.avail_spare')
data_written_tb=$(echo $smart | jq '.data_units_written * 512000 / 1e12')
media_errors=$(echo $smart | jq '.media_errors')
echo "Percent Used: ${pct_used}%"
echo "Available Spare: ${avail_spare}%"
echo "Data Written: ${data_written_tb} TB"
echo "Media Errors: ${media_errors}"
# Alert thresholds
if (( $(echo "$pct_used > 80" | bc -l) )); then
echo "🔴 CRITICAL: SSD life nearly exhausted!"
# Send alert to monitoring system
curl -X POST "$ALERTMANAGER_URL" -d "{\"alert\":\"ssd_endurance_critical\",\"device\":\"$dev\"}"
elif (( $(echo "$pct_used > 50" | bc -l) )); then
echo "🟡 WARNING: SSD past 50% life"
fi
if (( $(echo "$avail_spare < 10" | bc -l) )); then
echo "🔴 CRITICAL: Spare blocks nearly exhausted!"
fi
if (( media_errors > 0 )); then
echo "🔴 CRITICAL: Media errors detected - replace drive!"
fi
echo ""
done
Separate high-wear workloads (ZeRO offload) from low-wear (checkpoints). Prevents random writes from fragmenting sequential write areas.
Garbage collection in one namespace doesn't impact latency in another. Critical for latency-sensitive inference.
Prevent one workload from consuming all space. Checkpoints get dedicated capacity.
#!/bin/bash
# setup_ai_namespaces.sh - Configure NVMe for AI training
DEVICE="/dev/nvme0"
# Check current namespace configuration
nvme list-ns $DEVICE
nvme id-ctrl $DEVICE | grep -E "nn|tnvmcap"
# Delete existing namespaces (DESTRUCTIVE!)
# nvme delete-ns $DEVICE -n 1
# Get total capacity in 512-byte blocks
TOTAL_BLOCKS=$(nvme id-ctrl $DEVICE | grep tnvmcap | awk '{print $3}')
# Namespace allocation strategy for 3.84TB SSD:
# NS1: 2TB - Training data (sequential reads)
# NS2: 1TB - Checkpoints (sequential writes)
# NS3: 500GB - ZeRO offload (random read/write)
# NS4: 340GB - Scratch/temp (expendable)
# Create namespaces (sizes in 512-byte blocks)
NS1_SIZE=4294967296 # 2TB
NS2_SIZE=2147483648 # 1TB
NS3_SIZE=1073741824 # 500GB
NS4_SIZE=732807168 # ~340GB
# Create NS1: Training Data
nvme create-ns $DEVICE \
--nsze=$NS1_SIZE \
--ncap=$NS1_SIZE \
--flbas=0 \
--dps=0 \
--nmic=0
nvme attach-ns $DEVICE --namespace-id=1 --controllers=0
# Create NS2: Checkpoints
nvme create-ns $DEVICE \
--nsze=$NS2_SIZE \
--ncap=$NS2_SIZE \
--flbas=0 \
--dps=0
nvme attach-ns $DEVICE --namespace-id=2 --controllers=0
# Create NS3: ZeRO Offload (consider enabling T10 DIF for data integrity)
nvme create-ns $DEVICE \
--nsze=$NS3_SIZE \
--ncap=$NS3_SIZE \
--flbas=0 \
--dps=1 \ # Enable Type 1 protection
--nmic=0
nvme attach-ns $DEVICE --namespace-id=3 --controllers=0
# Create NS4: Scratch
nvme create-ns $DEVICE \
--nsze=$NS4_SIZE \
--ncap=$NS4_SIZE \
--flbas=0 \
--dps=0
nvme attach-ns $DEVICE --namespace-id=4 --controllers=0
# Rescan to see new namespaces
nvme ns-rescan $DEVICE
# Format and mount
mkfs.xfs -f /dev/nvme0n1 # Training data
mkfs.xfs -f /dev/nvme0n2 # Checkpoints
mkfs.xfs -f /dev/nvme0n3 # ZeRO offload
mkfs.xfs -f /dev/nvme0n4 # Scratch
mkdir -p /mnt/nvme/{data,checkpoints,zero_offload,scratch}
mount /dev/nvme0n1 /mnt/nvme/data
mount /dev/nvme0n2 /mnt/nvme/checkpoints
mount /dev/nvme0n3 /mnt/nvme/zero_offload
mount /dev/nvme0n4 /mnt/nvme/scratch
echo "Namespace configuration complete!"
nvme list
nvme id-ctrl)ZNS divides the SSD into zones that must be written sequentially. This can significantly reduce write amplification from garbage collection - perfect for checkpoint workloads.
Checkpoints are large sequential writes. ZNS WAF = 1.0 (theoretical minimum). No background GC = predictable latency during training.
GDS + ZNS integration is experimental. Requires zone-aware applications. Limited vendor support (Western Digital, Samsung).
| Feature | Conventional NVMe | ZNS NVMe | AI Checkpoint Impact |
|---|---|---|---|
| Write Pattern | Random allowed | Sequential only (per zone) | Matches checkpoint pattern |
| WAF (Typical) | 1.5 - 4.0 | 1.0 (no GC) | 2-4x endurance improvement |
| Latency Variance | High (GC spikes) | Low (no GC) | Predictable checkpoint time |
| Over-provisioning | 7-28% | 0% needed | More usable capacity |
| GDS Support | Full | Experimental | Requires zone-aware code |
#!/bin/bash
# ZNS configuration for AI checkpoints
# Check if device supports ZNS
nvme id-ns /dev/nvme0n1 -H | grep -i "zoned"
# Zoned Namespace Command Set Identifier: Zoned Namespace
# List zones
nvme zns report-zones /dev/nvme0n1 -d 0
# Zone 0: slba 0x0, wp 0x0, state EMPTY, type SEQ_WRITE_REQUIRED
# Zone 1: slba 0x80000, wp 0x80000, state EMPTY, type SEQ_WRITE_REQUIRED
# Zone capacity (typically 256MB - 2GB per zone)
nvme zns id-ns /dev/nvme0n1 | grep -i "zone"
# Zone Size: 524288 blocks (256MB)
# Zone Capacity: 524288 blocks
# Reset a zone (required before rewriting)
nvme zns reset-zone /dev/nvme0n1 -s 0x0 # Reset zone 0
nvme zns reset-zone /dev/nvme0n1 -a # Reset ALL zones
# Zone append (atomic sequential write)
# Used by f2fs, btrfs ZNS support, or direct nvme-cli
nvme zns zone-append /dev/nvme0n1 -s 0x0 -z 4096 -d checkpoint.bin
# Filesystem options for ZNS
# Option 1: f2fs (native ZNS support)
mkfs.f2fs -m /dev/nvme0n1
mount -t f2fs /dev/nvme0n1 /mnt/zns_checkpoints
# Option 2: dm-zoned (exposes as conventional block device)
# Adds random write support with minimal overhead
dmzadm --format /dev/nvme0n1
dmzadm --start /dev/nvme0n1
mkfs.xfs /dev/dm-0
mount /dev/dm-0 /mnt/zns_checkpoints
# Option 1: LUKS encryption (software, ~5-10% overhead)
cryptsetup luksFormat /dev/nvme0n1
cryptsetup open /dev/nvme0n1 nvme_encrypted
mkfs.xfs /dev/mapper/nvme_encrypted
mount /dev/mapper/nvme_encrypted /mnt/secure_nvme
# Option 2: NVMe SED (Self-Encrypting Drive, ~0% overhead)
# Check if drive supports TCG Opal
sedutil-cli --scan
# Initialize Opal locking
sedutil-cli --initialSetup <password> /dev/nvme0
sedutil-cli --enableLockingRange 0 <password> /dev/nvme0
sedutil-cli --setLockingRange 0 RW <password> /dev/nvme0
# Enable pre-boot authentication (PBA) for full protection
sedutil-cli --loadPBAimage <password> /path/to/pba.img /dev/nvme0
sedutil-cli --setMBREnable on <password> /dev/nvme0
# Enable DH-HMAC-CHAP authentication for NVMe-oF
# Server (target) side:
nvme gen-dhchap-key --hmac 1 --nqn nqn.2024-01.com.company:storage
# Output: DHHC-1:00:xxxxx
# Configure target with authentication
cat > /etc/nvmet/subsystems/nvme-subsys/attr_dhchap_key << EOF
DHHC-1:00:xxxxx
EOF
# Client (host) side:
nvme connect \
-t tcp \
-a 192.168.1.100 \
-s 4420 \
-n nqn.2024-01.com.company:storage \
--dhchap-secret=DHHC-1:00:xxxxx
# CRITICAL: Before disposing of or returning SSDs with sensitive data
# Check sanitize capabilities
nvme id-ctrl /dev/nvme0 | grep -i sanitize
# Option 1: Cryptographic Erase (fastest, ~seconds)
# Destroys encryption key, making data unrecoverable
nvme sanitize /dev/nvme0 --sanact=4 # Crypto Erase
# Option 2: Block Erase (~minutes)
nvme sanitize /dev/nvme0 --sanact=2 # Block Erase
# Option 3: Overwrite (slowest, ~hours, most thorough)
nvme sanitize /dev/nvme0 --sanact=1 --ovrpat=0xDEADBEEF
# Monitor sanitize progress
nvme sanitize-log /dev/nvme0
# Verify completion
nvme sanitize-log /dev/nvme0 | grep -i "Sanitize Status"
#!/bin/bash
# disable_nvme_power_management.sh
# 1. Disable APST (Autonomous Power State Transitions)
for dev in /dev/nvme*; do
nvme set-feature $dev -f 0x0c -v 0
echo "Disabled APST on $dev"
done
# 2. Kernel-level disable
echo 0 > /sys/module/nvme_core/parameters/default_ps_max_latency_us
# 3. Make persistent across reboots
cat >> /etc/modprobe.d/nvme.conf << EOF
options nvme_core default_ps_max_latency_us=0
EOF
# 4. Verify power state is PS0
for dev in /dev/nvme*; do
echo "=== $dev ==="
nvme get-feature $dev -f 0x0c -H # Should show "Autonomous Power State Transition Enable (APSTE): Disabled"
# Check current power state
cat /sys/class/nvme/$(basename $dev)/device/power_state
done
# 5. Monitor for power state transitions (should be none)
nvme get-log /dev/nvme0 --log-id=0x80 --log-len=512 | xxd | head -20
| Power State | Entry Latency | Exit Latency | Impact at 1M IOPS | Recommendation |
|---|---|---|---|---|
| PS0 (Active) | 0 | 0 | None | Use This |
| PS1 (Idle) | ~100μs | ~100μs | ~100 ops lost | Avoid |
| PS2 (Light Sleep) | ~1ms | ~1ms | ~1000 ops lost | Disable |
| PS3/PS4 (Deep Sleep) | ~5-50ms | ~5-50ms | ~5000-50000 ops lost | Disable |
# Check current firmware versions
for dev in /dev/nvme*; do
echo "=== $dev ==="
nvme id-ctrl $dev | grep -E "^fr |^mn "
done
# Download firmware from vendor (example: Samsung)
# Always verify checksum!
wget https://semiconductor.samsung.com/resources/software/PM9A3_GDC5602Q.enc
sha256sum PM9A3_GDC5602Q.enc # Verify matches vendor-provided hash
# Update firmware (REQUIRES PLANNING!)
# Option 1: Online update (if supported, no reboot needed)
nvme fw-download /dev/nvme0 --fw=PM9A3_GDC5602Q.enc
nvme fw-commit /dev/nvme0 --slot=1 --action=1 # Activate immediately
# Option 2: Offline update (safer, requires reboot)
nvme fw-download /dev/nvme0 --fw=PM9A3_GDC5602Q.enc
nvme fw-commit /dev/nvme0 --slot=1 --action=2 # Activate on next reset
# Then reboot
# Verify update
nvme id-ctrl /dev/nvme0 | grep "^fr "
#!/bin/bash
# rolling_firmware_update.sh - Update RAID without downtime
RAID_DEVICE="/dev/md0"
FIRMWARE_FILE="PM9A3_GDC5602Q.enc"
# Get member drives
MEMBERS=$(mdadm --detail $RAID_DEVICE | grep '/dev/nvme' | awk '{print $NF}')
for member in $MEMBERS; do
echo "=== Updating $member ==="
# 1. Mark drive as faulty and remove from array
mdadm --manage $RAID_DEVICE --fail $member
mdadm --manage $RAID_DEVICE --remove $member
# 2. Wait for array to stabilize
sleep 10
# 3. Get controller device from namespace device
CTRL_DEV=$(echo $member | sed 's/n[0-9]*$//')
# 4. Update firmware
nvme fw-download $CTRL_DEV --fw=$FIRMWARE_FILE
nvme fw-commit $CTRL_DEV --slot=1 --action=3 # Activate on next controller reset
# 5. Reset controller to apply firmware
nvme reset $CTRL_DEV
sleep 5
# 6. Verify new firmware
nvme id-ctrl $CTRL_DEV | grep "^fr "
# 7. Re-add to array
mdadm --manage $RAID_DEVICE --add $member
# 8. Wait for rebuild before proceeding to next drive
echo "Waiting for rebuild..."
while grep -q "recovery" /proc/mdstat; do
sleep 30
cat /proc/mdstat
done
echo "$member updated successfully"
done
echo "All drives updated!"
mdadm --detail $RAID_DEVICE
# prometheus/gpu_storage_rules.yml
groups:
- name: gpu_storage_alerts
interval: 15s
rules:
# NVMe Health Alerts
- alert: NVMeHighLatency
expr: nvme_read_latency_p99_us > 500
for: 5m
labels:
severity: warning
annotations:
summary: "NVMe P99 read latency > 500µs"
description: "Drive {{ $labels.device }} showing high latency. Check for thermal throttling or wear."
- alert: NVMeCriticalWear
expr: nvme_percentage_used > 90
for: 1h
labels:
severity: critical
annotations:
summary: "NVMe drive > 90% lifetime wear"
description: "{{ $labels.device }} at {{ $value }}% wear. Plan replacement within 30 days."
- alert: NVMeAvailableSparelow
expr: nvme_available_spare_percent < 10
labels:
severity: critical
annotations:
summary: "NVMe spare capacity critically low"
# GPU-Storage Bandwidth Alerts
- alert: GDSBandwidthDegraded
expr: rate(gds_bytes_read_total[5m]) < 5e9 # < 5 GB/s
for: 10m
labels:
severity: warning
annotations:
summary: "GDS bandwidth below expected threshold"
description: "GPU {{ $labels.gpu_id }} GDS throughput at {{ $value | humanize }}B/s"
- alert: PCIeBandwidthBottleneck
expr: pcie_bandwidth_utilization > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "PCIe bandwidth > 85% utilized"
# Training Job Alerts
- alert: CheckpointWriteSlow
expr: histogram_quantile(0.99, checkpoint_write_seconds_bucket) > 300
labels:
severity: warning
annotations:
summary: "Checkpoint writes taking > 5 minutes"
- alert: DataLoadStall
expr: rate(dataloader_samples_total[1m]) == 0
AND on(job) training_step_in_progress == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Data loading stalled - GPUs likely idle"
# Recording rules for dashboard efficiency
- name: gpu_storage_recording
rules:
- record: job:nvme_iops:rate5m
expr: sum(rate(nvme_read_commands_total[5m]) + rate(nvme_write_commands_total[5m])) by (device)
- record: job:gds_bandwidth:rate5m
expr: sum(rate(gds_bytes_read_total[5m]) + rate(gds_bytes_written_total[5m])) by (gpu_id)
- record: job:storage_efficiency:ratio
expr: |
sum(rate(gds_bytes_read_total[5m])) /
sum(rate(nvme_bytes_read_total[5m]))
# gpu_storage_exporter.py - Custom Prometheus exporter
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import subprocess
import json
import time
# NVMe metrics
nvme_read_latency = Histogram(
'nvme_read_latency_seconds',
'NVMe read latency distribution',
['device'],
buckets=[.0001, .0005, .001, .005, .01, .05, .1, .5, 1]
)
nvme_percentage_used = Gauge(
'nvme_percentage_used',
'NVMe lifetime percentage used',
['device', 'serial']
)
nvme_temperature = Gauge(
'nvme_temperature_celsius',
'NVMe temperature',
['device', 'sensor']
)
nvme_available_spare = Gauge(
'nvme_available_spare_percent',
'NVMe available spare capacity',
['device']
)
# GDS metrics
gds_bytes_read = Counter(
'gds_bytes_read_total',
'Total bytes read via GDS',
['gpu_id']
)
gds_read_latency = Histogram(
'gds_read_latency_seconds',
'GDS read latency',
['gpu_id', 'operation_size'],
buckets=[.0001, .0005, .001, .002, .005, .01, .02, .05, .1]
)
def collect_nvme_metrics():
"""Collect metrics from nvme-cli"""
result = subprocess.run(
['nvme', 'list', '-o', 'json'],
capture_output=True, text=True
)
devices = json.loads(result.stdout)['Devices']
for dev in devices:
device_path = dev['DevicePath']
# Get SMART data
smart = subprocess.run(
['nvme', 'smart-log', device_path, '-o', 'json'],
capture_output=True, text=True
)
smart_data = json.loads(smart.stdout)
nvme_percentage_used.labels(
device=device_path,
serial=dev['SerialNumber']
).set(smart_data['percent_used'])
nvme_temperature.labels(
device=device_path,
sensor='composite'
).set(smart_data['temperature'] - 273) # Kelvin to Celsius
nvme_available_spare.labels(
device=device_path
).set(smart_data['avail_spare'])
def collect_dcgm_metrics():
"""Collect GPU metrics from DCGM"""
import pydcgm
import dcgm_fields
dcgm_handle = pydcgm.DcgmHandle()
group = dcgm_handle.GetDefaultGroup()
# Collect PCIe throughput (indicates storage traffic)
field_ids = [
dcgm_fields.DCGM_FI_DEV_PCIE_TX_THROUGHPUT,
dcgm_fields.DCGM_FI_DEV_PCIE_RX_THROUGHPUT,
dcgm_fields.DCGM_FI_DEV_GPU_UTIL,
dcgm_fields.DCGM_FI_DEV_MEM_COPY_UTIL,
]
values = group.samples.GetLatest(field_ids)
# ... export to Prometheus gauges
if __name__ == '__main__':
start_http_server(9090)
while True:
collect_nvme_metrics()
collect_dcgm_metrics()
time.sleep(15)
# grafana/dashboards/gpu_storage_overview.json (key panels)
{
"title": "GPU-Storage Performance Overview",
"panels": [
{
"title": "GDS Bandwidth by GPU",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(gds_bytes_read_total[5m])) by (gpu_id)",
"legendFormat": "GPU {{ gpu_id }}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 3e9, "color": "yellow"},
{"value": 5e9, "color": "green"}
]
}
}
}
},
{
"title": "NVMe Latency Heatmap",
"type": "heatmap",
"targets": [{
"expr": "sum(rate(nvme_read_latency_seconds_bucket[1m])) by (le)",
"format": "heatmap"
}]
},
{
"title": "Storage Efficiency Ratio",
"type": "gauge",
"description": "GDS bytes / Total NVMe bytes (higher = more direct GPU access)",
"targets": [{
"expr": "job:storage_efficiency:ratio"
}],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 1,
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 0.5, "color": "yellow"},
{"value": 0.8, "color": "green"}
]
}
}
}
},
{
"title": "NVMe Drive Health",
"type": "table",
"targets": [
{"expr": "nvme_percentage_used", "legendFormat": "% Used"},
{"expr": "nvme_available_spare_percent", "legendFormat": "Spare %"},
{"expr": "nvme_temperature_celsius", "legendFormat": "Temp °C"}
]
}
]
}
# DCGM setup for GPU-storage correlation monitoring
# Install DCGM
sudo apt-get install datacenter-gpu-manager
# Start DCGM daemon
sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgm
# Create field group for storage-related metrics
dcgmi group -c storage_monitoring
dcgmi group -g 1 -a 0,1,2,3,4,5,6,7 # Add GPUs 0-7
# Key fields for storage correlation
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT - PCIe TX (GPU → storage for writes)
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT - PCIe RX (storage → GPU for reads)
# DCGM_FI_DEV_PCIE_REPLAY_COUNTER - PCIe errors (indicates link issues)
# DCGM_FI_DEV_GPU_UTIL - GPU utilization (low = data starvation)
# DCGM_FI_DEV_MEM_COPY_UTIL - Memory copy utilization
# Start Prometheus-compatible exporter
dcgm-exporter --collectors /etc/dcgm-exporter/dcp-metrics-included.csv \
--address :9400 \
--collect-interval 5000 # 5 second collection
# nvtop installation and usage
# Install
sudo apt-get install nvtop
# Launch with storage-focused view
nvtop --no-plot # Disable GPU plot to see more processes
# Key metrics to watch for storage issues:
# - MEM: If GPU memory is full, cannot prefetch data
# - GPU%: Low utilization often indicates I/O bottleneck
# - PCIe RX: Direct indicator of data flowing to GPU
# Programmatic monitoring with py3nvml
import py3nvml.py3nvml as nvml
nvml.nvmlInit()
handle = nvml.nvmlDeviceGetHandleByIndex(0)
# Get PCIe throughput
tx = nvml.nvmlDeviceGetPcieThroughput(handle, nvml.NVML_PCIE_UTIL_TX_BYTES)
rx = nvml.nvmlDeviceGetPcieThroughput(handle, nvml.NVML_PCIE_UTIL_RX_BYTES)
print(f"PCIe TX: {tx / 1e9:.2f} GB/s")
print(f"PCIe RX: {rx / 1e9:.2f} GB/s")
# If RX is low during training → storage bottleneck
# If GPU util is low + RX is low → data loading issue
# NVMe Autonomous Power State Transition (APST)
# List supported power states
nvme id-ctrl /dev/nvme0 | grep -A 20 "ps "
# ps 0: max_power: 25W, entry_lat: 0µs, exit_lat: 0µs (operational)
# ps 1: max_power: 18W, entry_lat: 0µs, exit_lat: 0µs (operational)
# ps 2: max_power: 12W, entry_lat: 0µs, exit_lat: 0µs (operational)
# ps 3: max_power: 5mW, entry_lat: 1000µs, exit_lat: 2000µs (non-op)
# ps 4: max_power: 3mW, entry_lat: 5000µs, exit_lat: 10000µs (non-op)
# For GPU workloads: DISABLE low power states
# The exit latency (10ms for PS4) will kill performance
# Disable APST entirely
nvme set-feature /dev/nvme0 -f 0x0c -v 0
# Or set minimum power state to PS2 (operational)
# This keeps latency acceptable while saving some power
echo 2 > /sys/block/nvme0n1/device/power/pm_qos_latency_tolerance_us
# Linux kernel APST configuration
# /etc/modprobe.d/nvme.conf
options nvme default_ps_max_latency_us=0 # Disable all non-op states
# For training workloads (bursty I/O):
# Allow PS2 but not lower (balance power vs latency)
options nvme default_ps_max_latency_us=100
# Production monitoring for power state issues
def check_nvme_power_state(device):
"""Monitor NVMe power state transitions"""
import subprocess
result = subprocess.run(
['nvme', 'get-feature', device, '-f', '0x02', '-H'],
capture_output=True, text=True
)
# Parse current power state
current_ps = int(result.stdout.split('Power State:')[1].split()[0])
if current_ps > 2:
print(f"WARNING: {device} in low-power state PS{current_ps}")
print(" Next I/O will have high latency!")
return current_ps
# NVMe thermal monitoring and throttling detection
# Get thermal thresholds
nvme smart-log /dev/nvme0 | grep -i temp
# temperature: 45°C
# warning_temp_threshold: 70°C
# critical_temp_threshold: 80°C
# Thermal Management Temperature (TMT) - when throttling kicks in
nvme id-ctrl /dev/nvme0 | grep -i tmt
# TMT1: 0 (Heavy Throttling Temperature)
# TMT2: 0 (Light Throttling Temperature)
# Set thermal throttling thresholds (if configurable)
nvme set-feature /dev/nvme0 -f 0x10 -v 343 # TMT1 = 70°C (343K)
nvme set-feature /dev/nvme0 -f 0x10 -v 353 # TMT2 = 80°C (353K)
# Continuous thermal monitoring
import time
import subprocess
class NVMeThermalMonitor:
"""Monitor NVMe thermals with throttling detection"""
def __init__(self, device, warning_temp=65, critical_temp=75):
self.device = device
self.warning_temp = warning_temp
self.critical_temp = critical_temp
self.throttle_count = 0
self.last_temp = 0
def check(self):
result = subprocess.run(
['nvme', 'smart-log', self.device, '-o', 'json'],
capture_output=True, text=True
)
smart = json.loads(result.stdout)
temp = smart['temperature'] - 273 # Kelvin to Celsius
throttle_time = smart.get('thm_temp1_trans_count', 0)
status = 'OK'
if temp >= self.critical_temp:
status = 'CRITICAL'
elif temp >= self.warning_temp:
status = 'WARNING'
if throttle_time > self.throttle_count:
print(f"ALERT: {self.device} thermal throttling detected!")
self.throttle_count = throttle_time
self.last_temp = temp
return {'temp': temp, 'status': status, 'throttle_events': throttle_time}
# GPU proximity thermal considerations
"""
Physical Layout Matters:
BAD:
+ - - - - - - - - - - - - - - - - - - - - - - - |
- GPU (700W) - NVMe - ← NVMe getting blasted by GPU exhaust
- - (70°C) -
- - - - - - - - - - - - - - - - - - - - - - - - |
GOOD:
+ - - - - - - - - - - - - - - - - - - - - - - - |
- NVMe - -
- (45°C) - GPU (700W) - ← Airflow direction matters
- - -
- - - - - - - - - - - - - - - - - - - - - - - - |
• SSDs should be UPSTREAM of GPU airflow
• Minimum 2 slots spacing if possible
• Use NVMe heatsinks (they actually help)
• Monitor ambient temp in server room
"""
# AWS P5/P4d instance storage characteristics
# P5.48xlarge (H100 × 8)
# - Instance storage: 8 × 3.84 TB NVMe (30.72 TB total)
# - Storage bandwidth: ~200 GB/s aggregate
# - GDS: Supported with NVIDIA driver ≥ 525
# Optimal AWS P5 storage configuration
# 1. RAID0 the instance NVMe drives for max bandwidth
sudo mdadm --create /dev/md0 --level=0 --raid-devices=8 \
/dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 \
/dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1
# 2. Format with XFS (better for large files than ext4)
sudo mkfs.xfs -f -d su=256k,sw=8 /dev/md0
sudo mount -o noatime,nodiratime,discard /dev/md0 /mnt/nvme
# 3. Enable GDS
# AWS AMI with NVIDIA drivers should have GDS ready
sudo modprobe nvidia_fs
# 4. Verify GDS is working
/usr/local/cuda/gds/tools/gdscheck -p
# GDS Driver Status: OK
# Platform compatibility: SUPPORTED
# AWS-specific limitations:
# - Cannot change NVMe firmware
# - Cannot access NVMe-MI
# - Instance store is EPHEMERAL (lost on stop/terminate)
# - EBS volumes don't support GDS well (network latency)
# Best practice: Use instance store for training data, EBS for checkpoints
# /mnt/nvme - Training data (fast, ephemeral)
# /mnt/efs - Checkpoints (slower, durable)
# Azure ND H100 v5 series
# Standard_ND96isr_H100_v5 (H100 × 8)
# - Temp storage: 7.5 TB NVMe
# - InfiniBand: 400 Gb/s NDR
# Azure-specific storage setup
# 1. Find NVMe devices (Azure uses different naming)
lsblk -d | grep nvme
# 2. Azure Managed Lustre for distributed training
# Much better than local storage for multi-node
sudo lustre_client_install.sh
sudo mount -t lustre @tcp:/lustre /mnt/lustre
# 3. Azure Blob with BlobFuse2 for checkpoint storage
# Better durability than local NVMe
blobfuse2 mount /mnt/blob \
--config-file=/etc/blobfuse2.yaml \
--disable-writeback-cache=true \
--file-cache-timeout-in-seconds=0
# Azure limitations:
# - Temp storage is local NVMe but capacity varies
# - No direct GDS support officially documented
# - Best for IB-connected cluster storage (Lustre)
# GCP A3 (H100 × 8) and A2 (A100 × 16) instances
# a3-highgpu-8g (H100 × 8)
# - Local SSD: Up to 6 TB (24 × 375 GB partitions)
# - Network: 200 Gbps
# GCP local SSD setup for GPU workloads
# 1. Create instance with local SSDs
gcloud compute instances create gpu-training \
--machine-type=a3-highgpu-8g \
--zone=us-central1-c \
--local-ssd=interface=NVME \
--local-ssd=interface=NVME \
--local-ssd=interface=NVME \
--local-ssd=interface=NVME \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release
# 2. RAID the local SSDs
# GCP presents them as /dev/nvme0n*
sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 \
/dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4
# 3. GCS FUSE for durable storage
gcsfuse --implicit-dirs \
--file-mode=666 \
--dir-mode=777 \
--stat-cache-capacity=1000000 \
--type-cache-max-size-mb=1024 \
my-training-bucket /mnt/gcs
# GCP-specific advantages:
# - NVIDIA T4/V100/A100/H100 all available
# - Good GCS integration for checkpoints
# - Filestore (managed NFS) good for shared data
# GCP limitations:
# - Local SSD count/size fixed at instance creation
# - No GDS official support in documentation
| Feature | AWS P5 | Azure NDv5 | GCP A3 |
|---|---|---|---|
| Local NVMe | 8 × 3.84 TB | ~7.5 TB | Up to 6 TB |
| GDS Support | Yes (verified) | Unofficial | Unofficial |
| Network Storage | EBS, EFS, FSx | Managed Lustre, Blob | Filestore, GCS |
| Best For | Large-scale training | IB cluster training | Flexible workloads |
| Specification | H100 (Current) | B100/B200 (Shipping to partners) |
|---|---|---|
| HBM Bandwidth | 3.35 TB/s | 8 TB/s |
| HBM Capacity | 80 GB | 192 GB |
| NVLink Bandwidth | 900 GB/s | 1.8 TB/s |
| Storage Implication | 10 GB/s adequate | 20-50+ GB/s needed (multi-SSD or parallel FS) |
# AMD MI325X (Available) and MI400 (expected 2026) planning
# MI325X specifications (shipping 2024)
MI325X = {
'hbm_capacity': '256 GB', # Massive! Reduces storage pressure
'hbm_bandwidth': '6 TB/s',
'infinity_fabric': '896 GB/s', # GPU-GPU interconnect
'pcie': 'Gen5 x16', # 64 GB/s to storage
'storage_interface': 'ROCm + native NVMe',
}
# AMD advantage: 256 GB HBM means
# - Entire 70B model fits in single GPU HBM
# - Less reliance on NVMe offload
# - Checkpointing becomes main storage bottleneck
# MI400 (CDNA 4, expected 2026)
# - Expected 288-384 GB HBM4
# - CXL 3.0 support likely
# - UALink interconnect (alternative to NVLink)
# Storage planning for AMD:
# 1. ROCm doesn't have GDS equivalent (yet)
# 2. Focus on large sequential checkpoint writes
# 3. RAID NVMe arrays still essential
# 4. Watch for AMD's CXL storage announcements
14+ GB/s sequential, critical for current H100/MI300 deployments. Samsung PM9A3, Solidigm D7-P5520, Kioxia CM7.
DRAM-backed CXL devices shipping. ~200ns latency. Samsung CMM-D, CXL expanders. Good for GPU memory expansion, not storage.
~28 GB/s sequential. Will help close the GPU-storage gap. Watch for Samsung, Kioxia announcements.
CXL may enable new memory-tier devices; block storage over CXL.io remains command semantics. Could enable sub-μs access is speculative and medium-dependent for GPUs.
Industry alternative to NVLink. Storage integration unclear but will likely enable direct GPU-storage protocols.