The industry is investing billions in CXL memory expansion—but nobody has solved intelligent cache management.
Long-context LLM inference requires 41 GB of KV-cache per user at 128K tokens. Scale to 8 users and you need 468 GB—far exceeding any GPU's HBM capacity.
| Product | Vendor | Capability | The Gap |
|---|---|---|---|
| CMM-D / CMM-B | Samsung | 512GB–2TB, 60 GB/s, ~600ns | Passive DRAM—host manages all eviction |
| Apollo CXL Switch | XConn + MemVerge | 100 TiB pools, CXL 3.1 | No per-head tracking, no attention-aware eviction |
| Niagara 2.0 | Astera Labs | CXL Type-3, 640ns | Basic memory tier—no intelligent prefetch |
| Framework | Origin | Capability | The Gap |
|---|---|---|---|
| vLLM + LMCache | UC Berkeley | PagedAttention, CPU offload | Coarse-grained, no per-head policies |
| Mooncake | Moonshot AI | Distributed KV store | Network-based, not CXL-native |
| FlexGen | Stanford | GPU/CPU/SSD offload | Not attention-aware, slow PCIe |
| Paper | Innovation | The Gap |
|---|---|---|
| PNM-KV (Oct 2025) | Processing-near-memory, 21.9× throughput | Custom silicon, token-level only |
| CXL-SpecKV | FPGA speculative prefetch | Speculative (not attention-aware) |
| TraCT (Dec 2025) | CXL shared memory, rack-scale | Focus on sharing, not eviction |
All existing work treats KV-cache uniformly. None tracks 640 independent queues respecting GQA's 8 heads × 80 layers structure.
No system uses smoothed attention scores for eviction priority. Current approaches: LRU, random, or static policies.
Speculative prefetch exists, but none exploits RoPE's locality bias for predictive loading in the [P-W, P+W] window.
All CXL products are passive memory. No commercial device has on-controller cache management logic.
| Innovation | Implementation | Impact |
|---|---|---|
| GQA-Aware Architecture | 640 queues matching Llama-70B structure | Respects model topology |
| EMA Score Tracking | score[t] = 0.2 × attention[t] + 0.8 × score[t-1] | Stable eviction signals |
| RoPE Prefetch Window | Locality-aware [P-W, P+W] prefetch | 93%+ hit rate |
| CXL Controller Placement | Cache logic in endpoint firmware | Offloads GPU/CPU |
PNM-KV achieves impressive 21.9× throughput with processing-near-memory. But there's a critical difference:
| Aspect | PNM-KV | Our Approach |
|---|---|---|
| Granularity | Token-level selection | Per-KV-head eviction |
| Eviction Signal | Steady-token heuristic | Attention score EMA |
| Hardware | Custom PNM accelerator | Standard CXL controller |
| Model Awareness | Token statistics only | GQA structure + RoPE locality |
Our approach is more fine-grained and model-architecture-aware—without custom silicon.
CXL hardware is shipping. Software frameworks exist. The missing piece is intelligent cache management that understands transformer architecture.
Per-head granularity • Attention-weighted eviction • RoPE-aware prefetch • Controller-resident logic
View Full Documentation on GitHub →