NVMe, PCIe, and the evolution from spinning rust to flash — the foundation you need before diving into GPU storage challenges.
Understanding where we came from helps explain why NVMe exists and why it's still not enough for GPUs.
AHCI was designed when a single HDD could do ~100 IOPS. Modern NVMe SSDs can do 1,000,000+ IOPS. The old command model became the bottleneck, not the storage media.
NVMe rides on PCIe (Peripheral Component Interconnect Express). Understanding PCIe is essential for GPU-storage optimization.
NVMe scales across many CPU cores via multiple queues, but assumes a CPU-managed control plane (MMIO doorbells, queue pointer management, memory ordering). GPUs have ~100,000+ threads. When thousands of GPU threads try to ring the same doorbell register, they serialize. This is the core problem we'll explore in later chapters.
Note: NVMe is ~100-1000× faster than HDD, but still ~1000× slower than DRAM. For GPU workloads processing data at TB/s, even NVMe becomes a bottleneck.