KV Cache Offloading: Market Landscape

What exists today vs. what my research addresses

Available Today Commercial Hardware
CMM-D / CMM-B
Samsung
Hardware
512GB-2TB per module 60 GB/s 596ns latency
CXL 2.0 Type-3 memory expander. Passive DRAM—no compute, no intelligent caching.
Gap: Dumb memory only—host must manage all eviction/prefetch
Apollo CXL Switch
XConn + MemVerge
Hardware
100 TiB pools CXL 3.1 Multi-host
Memory pooling with GISMO software. Integrated with NVIDIA Dynamo for KV cache.
Gap: No per-head tracking, no attention-aware eviction
Niagara 2.0
Astera Labs
Hardware
CXL Type-3 640ns latency 10.1 GB/s
Used in recent TraCT research for rack-scale KV cache sharing.
Gap: Basic memory tier—no intelligent prefetch
Emerging Software Frameworks
vLLM + LMCache
UC Berkeley / Open Source
Software
PagedAttention CPU offload Prefix caching
Block-level KV management. Offloads to CPU/SSD. No CXL-specific optimization.
Gap: Coarse-grained eviction, no per-head policies
Mooncake
Moonshot AI
Software
Disaggregated prefill/decode Global KV store
Distributed KV cache across CPU + SSD. Network-based transfers.
Gap: Network latency, not CXL-native
FlexGen
Stanford
Software
GPU/CPU/SSD offload Throughput-oriented
Linear programming scheduler for offloading. High latency (8-12 GB/s PCIe).
Gap: Not attention-aware, slow PCIe path
Research (2024-2025) Academic Papers
PNM-KV / PnG-KV
Park et al., Oct 2025
Paper
CXL + Processing-Near-Memory 1M tokens 21.9× throughput
Offloads token selection to PNM accelerator in CXL. Steady-token mechanism.
Gap: Custom silicon, no per-head EMA tracking
CXL-SpecKV
Dec 2024
Paper
FPGA + CXL Speculative prefetch Compression
Predicts future token accesses, prefetches speculatively. FPGA-based.
Gap: Speculative (not attention-aware), FPGA prototype only
TraCT
Yoon et al., Dec 2025
Paper
CXL shared memory Rack-scale GPU-CXL DMA
Uses CXL as KV transfer substrate, bypasses NIC. Prefix-aware caching.
Gap: Focus on sharing, not intelligent eviction
🎯 What Nobody Has Yet
Per-KV-Head Tracking
All existing work treats KV cache uniformly. None tracks 640 independent LRU queues respecting GQA structure.
Attention-Aware EMA Eviction
No system uses smoothed attention scores for eviction priority. Current approaches: LRU, random, or static policies.
RoPE-Aware Prefetch
Speculative prefetch exists, but none exploits RoPE's locality bias for predictive loading.
CXL Controller Intelligence
All CXL products are passive memory. No commercial device has on-controller cache management logic.
🔬 My Research Contribution
GQA-Aware Architecture
640 queues matching Llama-70B's KV-head × layer structure
EMA Score Tracking
Per-position attention smoothing for stable eviction signals
RoPE Prefetch Window
Locality-aware [P-W, P+W] prefetch exploiting position encoding
CXL Controller Placement
Near-memory management offloaded from GPU/CPU
Industry Timeline
2022
Samsung CMM-D prototype CXL 2.0 ratified
2023
vLLM PagedAttention Intel Sapphire Rapids (CXL 1.1)
2024
Samsung CMM-B (16TB pools) CXL-SpecKV CXL 3.1 announced
2025
XConn + MemVerge 100TiB PNM-KV, TraCT NVIDIA Dynamo + CXL