KV Cache Offloading: Market Landscape

What exists today vs. what my research addresses

Available Today Commercial Hardware

CMM-D / CMM-B

Samsung

Hardware

512GB-2TB per module 60 GB/s 596ns latency

CXL 2.0 Type-3 memory expander. Passive DRAMâ€”no compute, no intelligent caching.

Gap: Dumb memory onlyâ€”host must manage all eviction/prefetch

Apollo CXL Switch

XConn + MemVerge

Hardware

100 TiB pools CXL 3.1 Multi-host

Memory pooling with GISMO software. Integrated with NVIDIA Dynamo for KV cache.

Gap: No per-head tracking, no attention-aware eviction

Niagara 2.0

Astera Labs

Hardware

CXL Type-3 640ns latency 10.1 GB/s

Used in recent TraCT research for rack-scale KV cache sharing.

Gap: Basic memory tierâ€”no intelligent prefetch

Emerging Software Frameworks

vLLM + LMCache

UC Berkeley / Open Source

Software

PagedAttention CPU offload Prefix caching

Block-level KV management. Offloads to CPU/SSD. No CXL-specific optimization.

Gap: Coarse-grained eviction, no per-head policies

Mooncake

Moonshot AI

Software

Disaggregated prefill/decode Global KV store

Distributed KV cache across CPU + SSD. Network-based transfers.

Gap: Network latency, not CXL-native

FlexGen

Stanford

Software

GPU/CPU/SSD offload Throughput-oriented

Linear programming scheduler for offloading. High latency (8-12 GB/s PCIe).

Gap: Not attention-aware, slow PCIe path

Research (2024-2025) Academic Papers

PNM-KV / PnG-KV

Park et al., Oct 2025

Paper

CXL + Processing-Near-Memory 1M tokens 21.9Ã— throughput

Offloads token selection to PNM accelerator in CXL. Steady-token mechanism.

Gap: Custom silicon, no per-head EMA tracking

CXL-SpecKV

Dec 2024

Paper

FPGA + CXL Speculative prefetch Compression

Predicts future token accesses, prefetches speculatively. FPGA-based.

Gap: Speculative (not attention-aware), FPGA prototype only

TraCT

Yoon et al., Dec 2025

Paper

CXL shared memory Rack-scale GPU-CXL DMA

Uses CXL as KV transfer substrate, bypasses NIC. Prefix-aware caching.

Gap: Focus on sharing, not intelligent eviction

🎯 What Nobody Has Yet

Per-KV-Head Tracking

All existing work treats KV cache uniformly. None tracks 640 independent LRU queues respecting GQA structure.

Attention-Aware EMA Eviction

No system uses smoothed attention scores for eviction priority. Current approaches: LRU, random, or static policies.

RoPE-Aware Prefetch

Speculative prefetch exists, but none exploits RoPE's locality bias for predictive loading.

CXL Controller Intelligence

All CXL products are passive memory. No commercial device has on-controller cache management logic.

🔬 My Research Contribution

GQA-Aware Architecture

640 queues matching Llama-70B's KV-head Ã— layer structure

EMA Score Tracking

Per-position attention smoothing for stable eviction signals

RoPE Prefetch Window

Locality-aware [P-W, P+W] prefetch exploiting position encoding

CXL Controller Placement

Near-memory management offloaded from GPU/CPU

Industry Timeline

2022

Samsung CMM-D prototype CXL 2.0 ratified

2023

vLLM PagedAttention Intel Sapphire Rapids (CXL 1.1)

2024

Samsung CMM-B (16TB pools) CXL-SpecKV CXL 3.1 announced

2025

XConn + MemVerge 100TiB PNM-KV, TraCT NVIDIA Dynamo + CXL