Cache Management Placement

Where do the 640 LRU queues + EMA tracking live?

System Architecture
A
🎮
GPU HBM
192GB · 8 TB/s
NVLink / PCIe 5.0
B
🖥
Host CPU
DDR5 · ~400 GB/s
C
🔌
CXL Controller
Type 2/3 · ~64 GB/s
CXL.mem / NVMe-oF
D
💾
Computational Storage
NVMe SSD · ~14 GB/s
A
GPU-Side
Where: CUDA kernel / HBM reservation
Lowest latency decisions
Attention scores already on GPU
Burns HBM for metadata (640 MB)
Kernel complexity for async I/O
B
Host CPU
Where: Userspace daemon / driver
Easy to implement & debug
Flexible policy changes
PCIe round-trip per decision
CPU becomes bottleneck at scale
D
Computational Storage
Where: NVMe SSD controller ARM cores
Offloads host entirely
Near-storage prefetch decisions
Higher latency path
Limited compute on SSD controller
Metadata Footprint (Llama-70B, 128K context)
LRU Queues
640
8 KV-heads × 80 layers
Entries per Queue
131,072
Max sequence length
Bytes per Entry
8 B
pos + ema + last_access
Total Metadata
640 MB
~1.6% of KV cache
Recommended: Hybrid Approach
GPU
EMA updates
→
CXL Controller
Eviction + Prefetch
→
NVMe
Cold storage
GPU computes attention → streams scores to CXL controller → controller updates EMA, makes eviction/prefetch decisions → issues async NVMe reads