NVIDIA B200

Compute (FP16 Dense)

4,500

TFLOPS

HBM3e Bandwidth

TB/s

HBM Capacity

192

THE PROBLEM: Attention is Memory-Bound

Transformer attention: dominated by KV-cache reads

Few FLOPs per byte loaded â†’ GPU waits for data

Llama-70B KV-Cache Size

80 layers Ã— 8 KV-heads Ã— 128 dim Ã— seq_len Ã— batch Ã— 4 bytes

128K Context

Batch 1

41 GB

Batch 32

1.31 TB

4K Context

Batch 1

1.3 GB

Batch 32

41 GB

CAPACITY WALL

KV-Cache (Batch 32)

1,280 GB

Ã·

HBM Capacity

192 GB

Working set exceeds on-chip memory by

16Ã—

BANDWIDTH WALL

HBM3 Bandwidth

8,000 GB/s

Ã·

PCIe 5.0 Sustained

51 GB/s

Off-chip access penalty

65Ã—

LATENCY IMPACT: 40 GB KV-Cache Read

From HBM

12 ms

40 GB Ã· 8,000 GB/s

From PCIe

780 ms

40 GB Ã· 51 GB/s

THE BOTTLENECK

The GPU doesn't lack compute. It lacks data.

Capacity gap (16Ã—) forces off-chip storage. Bandwidth gap (65Ã—) makes off-chip access catastrophic.

Sources
B200 specs: NVIDIA Datasheet | Llama-70B: NVIDIA NeMo / Meta

The Bandwidth-Compute Gap