KV Cache Offloading: Market Landscape

🏢

Commercial Products

Available today

CMM-D / CMM-B

Samsung

CXL 2.0

16 TB pools 60 GB/s ~600ns latency

CXL memory expander with large capacity DRAM pools.

Dumb DRAM â€” no compute, no intelligent caching

Apollo + GISMO

XConn + MemVerge

CXL 3.0

100 TiB pools NVIDIA Dynamo

CXL memory pooling integrated with NVIDIA's inference stack.

Memory pooling only â€” no attention-aware eviction

Niagara

Astera Labs

Type-3

CXL Expander Academic Research

CXL Type-3 memory expander used in university research settings.

Expander only â€” no processing capability

vLLM / LMCache

Open Source

Framework

PagedAttention CPU/SSD offload

Industry-standard inference framework with memory management.

Coarse-grained, not CXL-optimized

📚

Recent Research

Octâ€“Dec 2025 (arXiv)

arXiv 2025

PNM-KV

CXL-enabled processing-near-memory that offloads token page selection to a PNM accelerator.

21.9Ã— throughput (1M tokens)

arXiv 2025

CXL-SpecKV

FPGA-based speculative KV-cache prefetching with compression.

4â€“8Ã— memory expansion

arXiv 2025

TraCT

CXL shared memory as rack-scale KV cache with direct GPU load/store and DMA.

Rack-scale KV sharing

🎯

The Gap: Nobody Has Combined

Missing pieces for truly intelligent KV-cache management

📢

Per-KV-Head Tracking

Respecting GQA's 640 queues

📈

EMA Attention Scoring

Smoothed eviction priority

🧭

RoPE-Aware Prefetch

Position locality exploitation

🧠

Controller-Resident Intelligence

Logic in the CXL endpoint

💡 Closest: PNM-KV â€” but they do token selection, not per-head eviction with attention weighting. The opportunity is more fine-grained and model-architecture-aware.

White Space CXL Controller-Resident Intelligence

âœ“

Per-KV-Head Eviction

Track 640 GQA queues independently, evict at head granularity

âœ“

EMA-Based Scoring

Smooth attention scores over time, prevent thrashing

âœ“

RoPE Locality Prefetch

Exploit position encoding structure for predictive fetch

âœ“

Model-Architecture Aware

Understands transformer structure, not just memory pages

KV Cache Offloading Landscape