GPU-CXL KV-Cache Interaction
Tiered Memory Architecture for Large Context Inference
GPU (B200)
Tensor Cores
4,500 TFLOPS
FP16 Dense
HBM3e (Hot Tier)
192 GB
8 TB/s
Active KV-Cache
Current Attention Window
GPU MMU
Page Fault
→ CXL.mem Request
CXL.mem
Load/Store Semantics
No Driver Per-Access
→ Data Response
Cache Line (64B)
↓ Read Request
Physical Address
CXL 3.0 Switch
Address Routing | Coherency Management | Multi-Endpoint Aggregation
Endpoint 0
DRAM (Warm Tier)
256 GB
~250 ns latency
KV-Cache Entries
High attention score
Flash (Cold Tier)
4 TB
~50 μs latency
Evicted KV Entries
Low attention score
Firmware Controller
Per-head LRU | Prefetch | Score Tracking
Endpoint 1
DRAM (Warm Tier)
256 GB
~250 ns latency
KV-Cache Entries
High attention score
Flash (Cold Tier)
4 TB
~50 μs latency
Evicted KV Entries
Low attention score
Firmware Controller
Per-head LRU | Prefetch | Score Tracking
Endpoint 2
DRAM (Warm Tier)
256 GB
~250 ns latency
KV-Cache Entries
High attention score
Flash (Cold Tier)
4 TB
~50 μs latency
Evicted KV Entries
Low attention score
Firmware Controller
Per-head LRU | Prefetch | Score Tracking
KV-Cache Access Flow
1. Request
GPU attention kernel needs K/V for token position P
2. HBM Check
Hit → Return (12ms for full scan)
Miss → CXL.mem fault
3. Endpoint Lookup
DRAM hit → 250 ns
Flash hit → 50 μs
Miss → Recompute
4. Prefetch
Position P accessed →
Prefetch [P-W, P+W]
(RoPE locality)
Endpoint Cache Policy (Per KV-Head)
Access Tracking
• 640 LRU queues (8 heads × 80 layers)
• Entry: position, count, score_ema
• Metadata: 640 MB for 128K context
Score Integration
• Attention scores via CXL.io mailbox
• score_ema = α × new + (1-α) × old
• High score → Retain in DRAM
Eviction Priority
• f(recency_rank, score_ema)
• Low score + old → Flash
• High score → Keep in DRAM
Tiered Capacity (4 Endpoints)
Hot (HBM)
192 GB
8 TB/s
Warm (CXL DRAM)
1 TB
4 × 256 GB @ 250ns
Cold (Flash)
16 TB
4 × 4 TB @ 25μs
Total
~17 TB
vs 192 GB HBM-only
KEY INSIGHT
GPU sees unified address space. Endpoint manages tier placement transparently.
CXL.mem provides load/store semantics — no explicit I/O commands.
Endpoint firmware handles caching, prefetching, and promotion/demotion.