Appendix B.1: Storage Fundamentals

The Storage Evolution

Understanding where we came from helps explain why NVMe exists and why it's still not enough for GPUs.

1990s - 2000s

PATA/IDE Era

Parallel ATA. 133 MB/s max. Single drive per cable. The "ribbon cable" days.

2000s - 2010s

SATA + AHCI

Serial ATA. 600 MB/s max. AHCI command set designed for spinning disks with 1 command queue, 32 depth. Still used today for HDDs.

2010s

SAS for Enterprise

Serial Attached SCSI. 12 Gb/s. Dual-port reliability. Enterprise standard but expensive.

2011 - Present

NVMe Revolution

Direct PCIe connection. 64K queues × 64K depth. Purpose-built for flash. The current standard for performance storage.

Why NVMe Won

SATA/AHCI

1

Command Queue

32

Queue Depth

~6 GB/s

Max Bandwidth

SAS

1

Command Queue

254

Queue Depth

~2.4 GB/s

Max Bandwidth

NVMe
65,535
Command Queues
65,536
Queue Depth Each
~32 GB/s
PCIe Gen5 x4

💡 The Key Insight

AHCI was designed when a single HDD could do ~100 IOPS. Modern NVMe SSDs can do 1,000,000+ IOPS. The old command model became the bottleneck, not the storage media.

PCIe Fundamentals

NVMe rides on PCIe (Peripheral Component Interconnect Express). Understanding PCIe is essential for GPU-storage optimization.

PCIe Gen3

~1 GB/s

Per Lane

x4 = 4 GB/s, x16 = 16 GB/s

PCIe Gen4

~2 GB/s

Per Lane

x4 = 8 GB/s, x16 = 32 GB/s

PCIe Gen5
~4 GB/s
Per Lane
x4 = 16 GB/s, x16 = 64 GB/s

PCIe Topology Basics

CPU / Root Complex

→ The "root" of all PCIe. Contains memory controller, PCIe lanes.

PCIe Switch

→ Expands lanes. Enables peer-to-peer (P2P) between devices.

Endpoints (GPU, NVMe)

→ The actual devices. Each gets BAR (Base Address Register) mappings.

NVMe Architecture Overview

Controller The NVMe device's processor. Handles commands, manages flash. Namespace A logical partition of storage. Like a "virtual drive" on one physical SSD. Submission Queue (SQ) Where host places commands. Circular buffer in host memory. Completion Queue (CQ) Where controller places results. Also in host memory. Doorbell MMIO register to notify controller of new commands. This is the bottleneck for GPUs. Admin Queue Special queue pair for management commands (create queues, identify, etc). I/O Queues Queue pairs for actual read/write operations. Up to 65,535 of them. PRPs / SGLs Physical Region Pages / Scatter-Gather Lists. How NVMe knows where to DMA data.

⚠️ The GPU Problem Preview

NVMe scales across many CPU cores via multiple queues, but assumes a CPU-managed control plane (MMIO doorbells, queue pointer management, memory ordering). GPUs have ~100,000+ threads. When thousands of GPU threads try to ring the same doorbell register, they serialize. This is the core problem we'll explore in later chapters.

Latency Numbers Every Engineer Should Know

L1 Cache

~1 ns

L3 Cache

~10 ns

DRAM

~100 ns

NVMe SSD

~10-100 μs

HDD

~5-10 ms

Note: NVMe is ~100-1000× faster than HDD, but still ~1000× slower than DRAM. For GPU workloads processing data at TB/s, even NVMe becomes a bottleneck.