GPU (B200)

Tensor Cores

4,500 TFLOPS

FP16 Dense

HBM3e (Hot Tier)

192 GB

8 TB/s

Active KV-Cache

Current Attention Window

GPU MMU

Page Fault

â†’ CXL.mem Request

CXL.mem

Load/Store Semantics

No Driver Per-Access

â†’ Data Response

Cache Line (64B)

â†“ Read Request

Physical Address

CXL 3.0 Switch

Address Routing | Coherency Management | Multi-Endpoint Aggregation

Endpoint 0

DRAM (Warm Tier)

256 GB

~250 ns latency

KV-Cache Entries

High attention score

Flash (Cold Tier)

4 TB

~50 Î¼s latency

Evicted KV Entries

Low attention score

Firmware Controller

Per-head LRU | Prefetch | Score Tracking

Endpoint 1

DRAM (Warm Tier)

256 GB

~250 ns latency

KV-Cache Entries

High attention score

Flash (Cold Tier)

4 TB

~50 Î¼s latency

Evicted KV Entries

Low attention score

Firmware Controller

Per-head LRU | Prefetch | Score Tracking

Endpoint 2

DRAM (Warm Tier)

256 GB

~250 ns latency

KV-Cache Entries

High attention score

Flash (Cold Tier)

4 TB

~50 Î¼s latency

Evicted KV Entries

Low attention score

Firmware Controller

Per-head LRU | Prefetch | Score Tracking

KV-Cache Access Flow

1. Request

GPU attention kernel needs K/V for token position P

2. HBM Check

Hit â†’ Return (12ms for full scan)
Miss â†’ CXL.mem fault

3. Endpoint Lookup

DRAM hit â†’ 250 ns
Flash hit â†’ 50 Î¼s
Miss â†’ Recompute

4. Prefetch

Position P accessed â†’
Prefetch [P-W, P+W]
(RoPE locality)

Endpoint Cache Policy (Per KV-Head)

Access Tracking

â€¢ 640 LRU queues (8 heads Ã— 80 layers)
â€¢ Entry: position, count, score_ema
â€¢ Metadata: 640 MB for 128K context

Score Integration

â€¢ Attention scores via CXL.io mailbox
â€¢ score_ema = Î± Ã— new + (1-Î±) Ã— old
â€¢ High score â†’ Retain in DRAM

Eviction Priority

â€¢ f(recency_rank, score_ema)
â€¢ Low score + old â†’ Flash
â€¢ High score â†’ Keep in DRAM

Tiered Capacity (4 Endpoints)

Hot (HBM)

192 GB

8 TB/s

Warm (CXL DRAM)

1 TB

4 × 256 GB @ 250ns

Cold (Flash)

16 TB

4 × 4 TB @ 25μs

Total

~17 TB

vs 192 GB HBM-only

KEY INSIGHT

GPU sees unified address space. Endpoint manages tier placement transparently.

CXL.mem provides load/store semantics â€” no explicit I/O commands.
Endpoint firmware handles caching, prefetching, and promotion/demotion.

GPU-CXL KV-Cache Interaction