The Bandwidth-Compute Gap

Why Modern AI Infrastructure Starves for Data

NVIDIA B200
Compute (FP16 Dense)
4,500
TFLOPS
HBM3e Bandwidth
8
TB/s
HBM Capacity
192
GB
THE PROBLEM: Attention is Memory-Bound

Transformer attention: dominated by KV-cache reads

Few FLOPs per byte loaded → GPU waits for data
Llama-70B KV-Cache Size
80 layers × 8 KV-heads × 128 dim × seq_len × batch × 4 bytes
128K Context
Batch 1
41 GB
Batch 32
1.31 TB
4K Context
Batch 1
1.3 GB
Batch 32
41 GB
CAPACITY WALL
KV-Cache (Batch 32)
1,280 GB
÷
HBM Capacity
192 GB
Working set exceeds on-chip memory by
16×
BANDWIDTH WALL
HBM3 Bandwidth
8,000 GB/s
÷
PCIe 5.0 Sustained
51 GB/s
Off-chip access penalty
65×
LATENCY IMPACT: 40 GB KV-Cache Read
From HBM
12 ms
40 GB ÷ 8,000 GB/s
From PCIe
780 ms
40 GB ÷ 51 GB/s
THE BOTTLENECK
The GPU doesn't lack compute. It lacks data.
Capacity gap (16×) forces off-chip storage. Bandwidth gap (65×) makes off-chip access catastrophic.
Sources
B200 specs: NVIDIA Datasheet | Llama-70B: NVIDIA NeMo / Meta