CXL 3.0 UCIe UEC

KV Cache Offloading for LLM Inference

Distributed endpoint architecture with intelligent caching, attention-aware eviction, and CXL.mem acceleration

Per-Head Eviction
EMA Attention Scoring
RoPE-Aware Prefetch
Sam Pooni
UC Santa Cruz · zett.ai · San Jose, CA
Version 3.0 · December 2025
Figure 1

The Memory Wall Problem

Large language model inference faces a fundamental bottleneck: the memory required to serve long-context requests vastly exceeds what fits in GPU high-bandwidth memory (HBM).

192 GB
B200 HBM Capacity
140 GB
Llama-70B Weights
41 GB
KV-Cache @ 128K
468 GB
Total for 8 Users
CAPACITY WALL
8 users @ 128K exceeds memory by
2.4×
BANDWIDTH WALL
CXL vs PCIe penalty
65×
NVIDIA B200 Specifications
4,500
TFLOPS (FP16)
8
TB/s HBM3e
192
GB Capacity
64
GB/s PCIe 5.0
Figure 2

KV-Cache Size vs Context Length

KV-cache = 2 × L × H × D × S × bytes
Llama-70B: 2 × 80 × 8 × 128 × S × 2 = 320 KB/token
Context LengthKV-Cache Size% of B200 HBMStatus
4K tokens1.3 GB0.7%✓ Fits easily
32K tokens10 GB5.2%✓ Comfortable
128K tokens41 GB21%âš  Tight
512K tokens164 GB85%✗ Impossible
1M tokens328 GB171%✗ Impossible
The Scaling Crisis: Context lengths are expanding rapidly (GPT-4: 128K, Claude: 200K, Gemini: 1M+). KV-cache requirements grow linearly with context, but GPU memory remains fixed.
Figure 3

CXL Endpoint Architecture

A distributed endpoint is a CXL 3.0 Type-3 device combining memory, compute, and control logic into a single package.

DISTRIBUTED ENDPOINT PACKAGE (UCIe Integrated)
CXL 3.0 PROTOCOL ENGINE
CXL.mem
HDM-D/HDM-DB
CXL.io
Mailbox/Config
CXL.cache
Coherency
UCIe 1.1 — 1+ TB/s Die-to-Die Interconnect
MEMORY CONTROLLER
CH 0-3: DDR5-6400
CH 4-7: DDR5-6400
8ch × 51.2 GB/s = 409.6 GB/s
COMPUTE CHIPLET
Core 0-3: ARM A78 @ 3GHz
Core 4-7: ARM A78 @ 3GHz
L3 Cache — 8 MB Shared
CONTROL & POLICY
EMA Scoring Engine
Per-Head Access Tracker
RoPE Prefetch Queue
512 GB
DDR5 DRAM
16 TB
NVMe Flash
~250 ns
CXL Latency
80 W
Typical Power
Figure 4

Tiered Memory Architecture

Hot (HBM)
192 GB
8 TB/s
Warm (CXL)
1 TB
~250 ns
Cold (Flash)
16 TB
~25 μs
Total
~17 TB

Bandwidth Hierarchy

UCIe Internal
1+ TB/s
DDR5 Local
409.6 GB/s
CXL External
32-64 GB/s
NVMe Flash
~14 GB/s
Key Insight
GPU sees unified address space. Endpoint manages tier placement transparently.
CXL.mem provides load/store semantics—no explicit I/O commands, no DMA setup, no driver intervention.
Figure 5

CXL 3.0 Coherency Protocol

CXL 3.0 provides hardware-managed coherency through the Back-Invalidate (BI) protocol.

Invalid
GPU cache empty
Endpoint authoritative
Shared
GPU has read copy
Endpoint authoritative
Exclusive
GPU can write
Endpoint stale
Modified
GPU has dirty data
Must writeback

GPU → Endpoint Writes

During prefill, GPU writes new KV entries. Endpoint receives D2H Write with data, updates local DRAM and clears stale metadata.

Endpoint → GPU Invalidation

When endpoint evicts entries to flash, it issues BI-Snoop. GPU must writeback dirty data before acknowledging.

Concurrent SM Access: Multiple GPU SMs accessing the same KV-head are serialized at L2 cache. Endpoint sees unified coherent view—no per-SM tracking required.
Figure 6

Attention Mechanisms: MHA vs GQA vs MQA

MHA
Multi-Head Attention
Q heads = K heads = V heads
64 KV heads
Full memory cost
GQA
Grouped Query Attention
Multiple Q share K/V
8 KV heads
8× memory savings
MQA
Multi-Query Attention
All Q share single K/V
1 KV head
Quality tradeoff
8 KV-heads × 80 layers = 640 independent eviction policies
640
LRU Queues
131K
Entries/Queue
8 B
Bytes/Entry
640 MB
Total Metadata
Figure 7

EMA-Based Eviction Algorithm

Why LRU fails: LRU assumes recent access predicts future access. Attention violates this—a token at position 1,000 may not be accessed until position 100,000, but remains critically important.

score_ema = α × new_score + (1 − α) × score_ema
α → 1.0 (Reactive)
Trust recent scores. Good for bursty access patterns.
α → 0.1 (Stable)
Trust history. Good for persistent anchors.
priority = (1 − score_ema) × recency_decay
Higher priority → evict sooner
Token A: Important Anchor
Position1,024
Last access50 steps ago
score_ema0.211
recency_decay0.049
priority0.039
✓ KEEP IN CACHE
Token B: Low Attention
Position45,678
Last access2,000 steps ago
score_ema0.08
recency_decay0.865
priority0.796
🗑 EVICT TO FLASH
Figure 8

RoPE-Aware Prefetch Strategy

🔄 Rotary Encoding
RoPE encodes position by rotating Q/K vectors. Nearby positions have similar rotations → higher dot product → higher attention.
📍 Locality Bias
On average, ~70% of attention mass falls within ±W positions of the query token.
🎯 Predictable Access
If GPU requests position P, it will likely need P±W soon. Prefetch proactively.
Prefetch Rule: GPU accesses position P → Prefetch [P − W, P + W]
85%
Cache Hit Rate
72%
Attention Captured
3.2×
Latency Reduction
1.4×
BW Overhead
Figure 9

Prefill vs Decode Phase Characteristics

âš¡ Prefill Phase
BottleneckCompute-bound
Access PatternSequential writes
KV OperationsWrite-only (populate cache)
Arithmetic IntensityHigh (~100 FLOP/byte)
BatchingFull sequence parallel
Strategy: Stream writes directly to CXL. No eviction needed—all entries are new.
🔄 Decode Phase
BottleneckMemory-bound
Access PatternRandom reads + 1 write
KV OperationsRead all + append one
Arithmetic IntensityLow (~0.5 FLOP/byte)
BatchingToken-by-token
Strategy: Active EMA eviction + RoPE prefetch. This is where caching matters.
Phase Detection
Endpoint monitors write/read ratio. When reads exceed writes by 10×, switch to decode-optimized policy.
Figure 10

KV-Cache Quantization Support

Modern inference increasingly uses quantized KV-caches. The endpoint supports transparent compression:

FP16
Baseline
40 GB
FP8
<0.1% loss
20 GB
INT8
<0.5% loss
20 GB
INT4
~1% loss
10 GB
48
GB/s FP16→INT8
32
GB/s FP16→INT4
64
GB/s INT8→FP16
Transparent compression: GPU writes FP16 → Endpoint stores INT8 → GPU reads FP16. Compression invisible to inference stack.
Figure 11

Latency: CXL.mem vs PCIe Baseline

PCIe DMA Transfer
~13 μs
CPU in critical path
Driver + DMA setup overhead
CXL.mem Direct
~250 ns
Load/store semantics
Zero software overhead
65× Latency Improvement
By eliminating the software stack
StageCXL.mem Latency
GPU MMU handling~50 ns
CXL protocol processing~30 ns
PCIe transmission~70 ns
Endpoint memory access~50 ns
Total~250 ns
Figure 12

Layer Prefetch Pipeline

Single Endpoint Limitation: CXL x16 Gen5 = 64 GB/s. For 1.75 GB layer, transfer = 27.3 ms. Layer compute = 5.5 ms. GPU stalls 5× waiting.
EndpointsAggregate BWLayer TransferResult
164 GB/s27.3 ms❌ GPU stalls (5× slower)
3192 GB/s9.1 msâš  Borderline
5320 GB/s5.5 ms✓ Prefetch beats compute
🖥 GPU Compute
Layer N
Layer N+1
Layer N+2
Layer N+3
📡 CXL Prefetch
N+1
N+2
N+3
N+4
Pipeline Efficiency
With 5 endpoints, prefetch completes before compute finishes. Zero GPU stalls.
Figure 13

Software Integration Stack

Application Layer
vLLM
PagedAttention
TensorRT-LLM
Plugin API
SGLang
RadixAttention
↓
Runtime Layer
libcxl_kv
KV Allocation API
CUDA UVM
Unified Memory
Hint Interface
Policy Params
↓
Driver Layer
CXL Driver
Linux 6.8+
NVIDIA Driver
CXL.mem Support
↓
Hardware Layer
B200 GPU
CXL 3.0 Host
CXL Switch
Multi-Endpoint
Endpoints
Type-3 Devices
No kernel changes required. CXL memory appears as normal GPU-accessible memory. Framework changes limited to allocator layer.
Figure 14

Performance Sensitivity to Cache Hit Rate

95% hit
28 ms
90% hit
45 ms
85% hit
72 ms
80% hit
98 ms
70% hit
185 ms
60% hit
340 ms
Critical Threshold: Below 75% hit rate, flash access latency dominates. EMA + RoPE prefetch maintains 85%+ hit rate for typical workloads.
WorkloadP50P95P99P99.9
Conversational (4K avg)8 ms15 ms28 ms85 ms
Document QA (32K avg)25 ms45 ms95 ms250 ms
Long-context (128K)45 ms120 ms350 ms1.2 s
Figure 15

Power & Thermal Analysis

🔥
2,800 W
4× B200 GPUs
Liquid cooling required
❄
1,100 W
1× B200 + 5× Endpoints
Air cooling sufficient
Endpoint ComponentTypical PowerPeak Power
DDR5 (8 channels)40 W60 W
ARM A78 cores (8×)15 W25 W
CXL PHY + controller12 W18 W
UCIe interface8 W12 W
NVMe controller5 W8 W
Total per Endpoint80 W123 W
2.5×
Power Efficiency
Air
Cooling Type
45°C
Junction Temp
1U
Form Factor
Figure 16

Total Cost of Ownership (3-Year)

GPU-Only (8×B200)
Hardware$240,000
Power (3yr)$147,000
Cooling$50,000
Rack Space$36,000
Total$473,000
1× GPU + 5× Endpoints
Hardware$42,500
Power (3yr)$58,000
Cooling$10,000
Rack Space$18,000
Total$128,500
3.7×
TCO Reduction
73%
CapEx Savings
8 mo
Payback Period
$0.0004
Cost/Token
Break-Even Analysis
Endpoint architecture becomes cost-effective when: context_length > 16K tokens AND request_rate > 10 req/min
Figure 17

The Innovation Gap

Existing solutions address pieces of the problem. Nobody combines all four innovations:

📢
Per-Head Tracking
640 independent queues
📈
EMA Scoring
Attention-aware eviction
🧭
RoPE Prefetch
Position locality
🧠
Endpoint AI
Controller-resident logic
Existing SolutionTypeWhat's Missing
Samsung CMM-D/CMM-BCXL 2.0 Type-3No compute, no intelligence
XConn Apollo + GISMOCXL 3.0 SwitchPooling only, no eviction policy
vLLM PagedAttentionSoftwareStill GPU-memory limited
FlexGenCPU/Disk offloadHigh latency (~10s)
InfiniGenSpeculative prefetchCPU-based, limited bandwidth
CXL-SpecKVCXL + speculationNo per-head tracking
Figure 18

Summary: Key Results

65×
Latency vs PCIe
85%
Cache Hit Rate
3.7×
TCO Reduction
17 TB
Effective Capacity
ComponentSpecification
InterfaceCXL 3.0 Type-3 (memory expander)
Internal bandwidthUCIe: 1+ TB/s
External bandwidthCXL: 32-64 GB/s per link
Memory capacityDDR5: 256-512 GB per endpoint
ComputeARM/RISC-V cores for policy execution
Tracking granularityPer KV-head per layer (640 queues)
Eviction policyEMA attention score + recency decay
Prefetch strategyRoPE-aware window [P−W, P+W]
Metadata overhead~640 MB for 128K context (~1.6%)
The Core Insight
GPU handles parallel arithmetic. Endpoint handles memory management.
The division matches each architecture to its strengths—enabling long-context LLM inference at scale.