Market Analysis

KV Cache Offloading Landscape

Commercial products, software frameworks, and recent research

🏒
Commercial Products
Available today
CMM-D / CMM-B
Samsung
CXL 2.0
16 TB pools 60 GB/s ~600ns latency
CXL memory expander with large capacity DRAM pools.
Dumb DRAM Ò€” no compute, no intelligent caching
Apollo + GISMO
XConn + MemVerge
CXL 3.0
100 TiB pools NVIDIA Dynamo
CXL memory pooling integrated with NVIDIA's inference stack.
Memory pooling only Ò€” no attention-aware eviction
Niagara
Astera Labs
Type-3
CXL Expander Academic Research
CXL Type-3 memory expander used in university research settings.
Expander only Ò€” no processing capability
vLLM / LMCache
Open Source
Framework
PagedAttention CPU/SSD offload
Industry-standard inference framework with memory management.
Coarse-grained, not CXL-optimized
πŸ“š
Recent Research
OctΓ’β‚¬β€œDec 2025 (arXiv)
arXiv 2025
PNM-KV
CXL-enabled processing-near-memory that offloads token page selection to a PNM accelerator.
21.9Γƒβ€” throughput (1M tokens)
arXiv 2025
CXL-SpecKV
FPGA-based speculative KV-cache prefetching with compression.
4Γ’β‚¬β€œ8Γƒβ€” memory expansion
arXiv 2025
TraCT
CXL shared memory as rack-scale KV cache with direct GPU load/store and DMA.
Rack-scale KV sharing
🎯
The Gap: Nobody Has Combined
Missing pieces for truly intelligent KV-cache management
πŸ“’
Per-KV-Head Tracking
Respecting GQA's 640 queues
πŸ“ˆ
EMA Attention Scoring
Smoothed eviction priority
🧭
RoPE-Aware Prefetch
Position locality exploitation
🧠
Controller-Resident Intelligence
Logic in the CXL endpoint
πŸ’‘ Closest: PNM-KV Ò€” but they do token selection, not per-head eviction with attention weighting. The opportunity is more fine-grained and model-architecture-aware.
White Space CXL Controller-Resident Intelligence
Γ’Ε“β€œ
Per-KV-Head Eviction
Track 640 GQA queues independently, evict at head granularity
Γ’Ε“β€œ
EMA-Based Scoring
Smooth attention scores over time, prevent thrashing
Γ’Ε“β€œ
RoPE Locality Prefetch
Exploit position encoding structure for predictive fetch
Γ’Ε“β€œ
Model-Architecture Aware
Understands transformer structure, not just memory pages