Main A: GPU B: NVMe C: Production

The Storage Evolution

Understanding where we came from helps explain why NVMe exists and why it's still not enough for GPUs.

1990s - 2000s
PATA/IDE Era
Parallel ATA. 133 MB/s max. Single drive per cable. The "ribbon cable" days.
2000s - 2010s
SATA + AHCI
Serial ATA. 600 MB/s max. AHCI command set designed for spinning disks with 1 command queue, 32 depth. Still used today for HDDs.
2010s
SAS for Enterprise
Serial Attached SCSI. 12 Gb/s. Dual-port reliability. Enterprise standard but expensive.
2011 - Present
NVMe Revolution
Direct PCIe connection. 64K queues × 64K depth. Purpose-built for flash. The current standard for performance storage.

Why NVMe Won

SATA/AHCI
1
Command Queue
32
Queue Depth
~6 GB/s
Max Bandwidth
SAS
1
Command Queue
254
Queue Depth
~2.4 GB/s
Max Bandwidth
NVMe
65,535
Command Queues
65,536
Queue Depth Each
~32 GB/s
PCIe Gen5 x4
💡 The Key Insight

AHCI was designed when a single HDD could do ~100 IOPS. Modern NVMe SSDs can do 1,000,000+ IOPS. The old command model became the bottleneck, not the storage media.

PCIe Fundamentals

NVMe rides on PCIe (Peripheral Component Interconnect Express). Understanding PCIe is essential for GPU-storage optimization.

PCIe Gen3
~1 GB/s
Per Lane
x4 = 4 GB/s, x16 = 16 GB/s
PCIe Gen4
~2 GB/s
Per Lane
x4 = 8 GB/s, x16 = 32 GB/s
PCIe Gen5
~4 GB/s
Per Lane
x4 = 16 GB/s, x16 = 64 GB/s

PCIe Topology Basics

CPU / Root Complex
The "root" of all PCIe. Contains memory controller, PCIe lanes.
PCIe Switch
Expands lanes. Enables peer-to-peer (P2P) between devices.
Endpoints (GPU, NVMe)
The actual devices. Each gets BAR (Base Address Register) mappings.

NVMe Architecture Overview

Controller The NVMe device's processor. Handles commands, manages flash. Namespace A logical partition of storage. Like a "virtual drive" on one physical SSD. Submission Queue (SQ) Where host places commands. Circular buffer in host memory. Completion Queue (CQ) Where controller places results. Also in host memory. Doorbell MMIO register to notify controller of new commands. This is the bottleneck for GPUs. Admin Queue Special queue pair for management commands (create queues, identify, etc). I/O Queues Queue pairs for actual read/write operations. Up to 65,535 of them. PRPs / SGLs Physical Region Pages / Scatter-Gather Lists. How NVMe knows where to DMA data.
⚠️ The GPU Problem Preview

NVMe scales across many CPU cores via multiple queues, but assumes a CPU-managed control plane (MMIO doorbells, queue pointer management, memory ordering). GPUs have ~100,000+ threads. When thousands of GPU threads try to ring the same doorbell register, they serialize. This is the core problem we'll explore in later chapters.

Latency Numbers Every Engineer Should Know

L1 Cache
~1 ns
L3 Cache
~10 ns
DRAM
~100 ns
NVMe SSD
~10-100 μs
HDD
~5-10 ms

Note: NVMe is ~100-1000× faster than HDD, but still ~1000× slower than DRAM. For GPU workloads processing data at TB/s, even NVMe becomes a bottleneck.