Main A: GPU B: NVMe C: Production

"I have framed the GPU-storage problem space with publication-quality technical documentation. The 14 challenges taxonomy is genuinely useful for architects designing AI infrastructure. This is among the best GPU-storage integration documentation outside of internal NVIDIA/Micron engineering docs."

— Sam Pooni
|
30-Year Storage Industry Veteran, HPC/AI/Storage
📋 ASSUMPTIONS & SCOPE
Focus is GPU-centric AI/ML training pipelines. CPU still owns control plane today. Many NVMe features are optional/vendor-specific. CXL and UEC coverage reflects emerging standards—validate vendor support before deployment decisions.
GPU-NVMe Technical Documentation

Storage is the Bottleneck
A GPU-NVMe Technical Deep Dive

🎮 GPU Architecture 💾 NVMe Protocol ⚡ Performance Critical
  • NVMe scales well with host parallelism via many submission/completion queues, but it assumes a host-driven control plane (doorbells, completions, polling/interrupts) that is largely CPU-mediated.
  • Modern GPU workloads can consume data extremely fast, and when the input/checkpoint pipeline isn't carefully engineered, latency and control-plane overhead become visible bottlenecks.
  • Evolving GPU-centric storage is less about raw SSD peak GB/s and more about reducing submission/completion overhead, improving async/batched I/O, and aligning storage semantics with GPU pipelines.
4
Main Chapters
5
GPU/CUDA Sections
13
NVMe Sections
1MB+
Documentation

GPU-NVMe-Fabric Data Flow Architecture

Click components to highlight data paths • Watch the animated flow

NVIDIA B200 Blackwell Architecture SM SM SM SM 192GB HBM3e • 18,000 CUDA Cores HBM3e Memory 8 TB/s Host Memory NVMe SQ NVMe CQ RDMA SQ RDMA RQ Data PCIe Gen5 x16 NVMe SSD Controller + NAND 14 GB/s RNIC RDMA NIC InfiniBand / RoCE 400 Gb/s ↓ To Remote Storage / Fabric 1 Fabrics Cmd 2 Data Fetch 3 Submit NVMe Cmd 4 Doorbell 5 Cmd + Data 6 Completion 7 Fabrics Resp 8 GPUDirect Data Flow Legend Doorbell/Control Data Path Completion Fabric/RDMA
Scroll to explore

Main Chapters

Understanding why GPU-NVMe integration requires fundamental changes

Motivation: AI & Storage

Why storage is the critical bottleneck for AI infrastructure

01
  • GPU-centric AI infrastructure
  • Training pipeline data flows
  • CPU-mediated vs GPUDirect paths
Read Chapter →

Implementation Challenges

Current NVMe limitations & GPU-optimized recommendations

02
  • NVMe assumes CPU-mediated control plane
  • Doorbell serialization crisis
  • Interrupt-driven I/O vs GPU polling
Read Chapter →

Solutions Architecture

Technology roadmap & emerging standards

03
  • GPUDirect Storage deep dive
  • CXL memory semantics
  • NVMe protocol enhancements (shadow doorbells, batched submission)
Read Chapter →

Advanced & Hard Truths

Real-world architecture & honest assessments

04
  • What actually works today
  • Production deployment patterns
  • Cost vs performance tradeoffs
Read Chapter →

Technical Appendices

Deep-dive reference documentation with interactive visualizations

Key Topics

Technical Sources

SNIA SDC 2025 — Micron Technology

"Why does NVMe need to evolve for efficient storage access from GPUs?"

Chandra Guda (SMTS), Suresh Rajgopal (DMTS), Pierre Labat (SMTS)
SNIA Developer Conference, Hyatt Regency Santa Clara, CA
September 15-17, 2025

www.sniadeveloper.org

NVMe Specification

NVM Express Base Specification covering queue architecture, doorbell mechanisms, and command structures.

NVIDIA Documentation

CUDA Programming Guide, GPUDirect Storage documentation, and GPU architecture whitepapers.

Research Papers

BaM (Big Accelerator Memory), GPU-initiated I/O research, and PCIe topology analysis.