Cache Management Placement

System Architecture

🎮

GPU HBM

192GB · 8 TB/s

NVLink / PCIe 5.0

🖥

Host CPU

DDR5 Â· ~400 GB/s

🔌

CXL Controller

Type 2/3 Â· ~64 GB/s

CXL.mem / NVMe-oF

💾

Computational Storage

NVMe SSD Â· ~14 GB/s

GPU-Side

Where: CUDA kernel / HBM reservation

Lowest latency decisions

Attention scores already on GPU

Burns HBM for metadata (640 MB)

Kernel complexity for async I/O

Host CPU

Where: Userspace daemon / driver

Easy to implement & debug

Flexible policy changes

PCIe round-trip per decision

CPU becomes bottleneck at scale

Best for CXL

CXL Controller

Where: CXL Type-2 device FPGA/ASIC

Near-memory processing

Intercepts all CXL.mem traffic

Can prefetch to device-local DRAM

Custom hardware required

Computational Storage

Where: NVMe SSD controller ARM cores

Offloads host entirely

Near-storage prefetch decisions

Higher latency path

Limited compute on SSD controller

Metadata Footprint (Llama-70B, 128K context)

LRU Queues

640

8 KV-heads Ã— 80 layers

Entries per Queue

131,072

Max sequence length

Bytes per Entry

8 B

pos + ema + last_access

Total Metadata

640 MB

~1.6% of KV cache

Recommended: Hybrid Approach

GPU
EMA updates

â†’

CXL Controller
Eviction + Prefetch

â†’

NVMe
Cold storage

GPU computes attention â†’ streams scores to CXL controller â†’ controller updates EMA, makes eviction/prefetch decisions â†’ issues async NVMe reads