Figure 1
The Memory Wall Problem
Large language model inference faces a fundamental bottleneck: the memory required to serve long-context requests vastly exceeds what fits in GPU high-bandwidth memory (HBM).
CAPACITY WALL
8 users @ 128K exceeds memory by
2.4×
NVIDIA B200 Specifications
Figure 2
KV-Cache Size vs Context Length
KV-cache = 2 × L × H × D × S × bytes
Llama-70B: 2 × 80 × 8 × 128 × S × 2 = 320 KB/token
| Context Length | KV-Cache Size | % of B200 HBM | Status |
| 4K tokens | 1.3 GB | 0.7% | ✓ Fits easily |
| 32K tokens | 10 GB | 5.2% | ✓ Comfortable |
| 128K tokens | 41 GB | 21% | âš Tight |
| 512K tokens | 164 GB | 85% | ✗ Impossible |
| 1M tokens | 328 GB | 171% | ✗ Impossible |
The Scaling Crisis: Context lengths are expanding rapidly (GPT-4: 128K, Claude: 200K, Gemini: 1M+). KV-cache requirements grow linearly with context, but GPU memory remains fixed.
Figure 3
CXL Endpoint Architecture
A distributed endpoint is a CXL 3.0 Type-3 device combining memory, compute, and control logic into a single package.
DISTRIBUTED ENDPOINT PACKAGE (UCIe Integrated)
UCIe 1.1 — 1+ TB/s Die-to-Die Interconnect
MEMORY CONTROLLER
CH 0-3: DDR5-6400
CH 4-7: DDR5-6400
8ch × 51.2 GB/s = 409.6 GB/s
COMPUTE CHIPLET
Core 0-3: ARM A78 @ 3GHz
Core 4-7: ARM A78 @ 3GHz
L3 Cache — 8 MB Shared
CONTROL & POLICY
EMA Scoring Engine
Per-Head Access Tracker
RoPE Prefetch Queue
Figure 4
Tiered Memory Architecture
Hot (HBM)
192 GB
8 TB/s
Warm (CXL)
1 TB
~250 ns
Cold (Flash)
16 TB
~25 μs
Total
~17 TB
Bandwidth Hierarchy
Key Insight
GPU sees unified address space. Endpoint manages tier placement transparently.
CXL.mem provides load/store semantics—no explicit I/O commands, no DMA setup, no driver intervention.
Figure 5
CXL 3.0 Coherency Protocol
CXL 3.0 provides hardware-managed coherency through the Back-Invalidate (BI) protocol.
Invalid
GPU cache empty
Endpoint authoritative
Shared
GPU has read copy
Endpoint authoritative
Exclusive
GPU can write
Endpoint stale
Modified
GPU has dirty data
Must writeback
GPU → Endpoint Writes
During prefill, GPU writes new KV entries. Endpoint receives D2H Write with data, updates local DRAM and clears stale metadata.
Endpoint → GPU Invalidation
When endpoint evicts entries to flash, it issues BI-Snoop. GPU must writeback dirty data before acknowledging.
Concurrent SM Access: Multiple GPU SMs accessing the same KV-head are serialized at L2 cache. Endpoint sees unified coherent view—no per-SM tracking required.
Figure 6
Attention Mechanisms: MHA vs GQA vs MQA
MHA
Multi-Head Attention
Q heads = K heads = V heads
64 KV heads
Full memory cost
GQA
Grouped Query Attention
Multiple Q share K/V
8 KV heads
8× memory savings
MQA
Multi-Query Attention
All Q share single K/V
1 KV head
Quality tradeoff
8 KV-heads × 80 layers = 640 independent eviction policies
Figure 7
EMA-Based Eviction Algorithm
Why LRU fails: LRU assumes recent access predicts future access. Attention violates this—a token at position 1,000 may not be accessed until position 100,000, but remains critically important.
score_ema =
α ×
new_score + (1 −
α) ×
score_ema
α → 1.0 (Reactive)
Trust recent scores. Good for bursty access patterns.
α → 0.1 (Stable)
Trust history. Good for persistent anchors.
priority = (1 − score_ema) × recency_decay
Higher priority → evict sooner
Token A: Important Anchor
| Position | 1,024 |
| Last access | 50 steps ago |
| score_ema | 0.211 |
| recency_decay | 0.049 |
| priority | 0.039 |
✓ KEEP IN CACHE
Token B: Low Attention
| Position | 45,678 |
| Last access | 2,000 steps ago |
| score_ema | 0.08 |
| recency_decay | 0.865 |
| priority | 0.796 |
🗑 EVICT TO FLASH
Figure 8
RoPE-Aware Prefetch Strategy
🔄 Rotary Encoding
RoPE encodes position by rotating Q/K vectors. Nearby positions have similar rotations → higher dot product → higher attention.
📍 Locality Bias
On average, ~70% of attention mass falls within ±W positions of the query token.
🎯 Predictable Access
If GPU requests position P, it will likely need P±W soon. Prefetch proactively.
Prefetch Rule: GPU accesses position P → Prefetch [P − W, P + W]
Figure 9
Prefill vs Decode Phase Characteristics
âš¡ Prefill Phase
| Bottleneck | Compute-bound |
| Access Pattern | Sequential writes |
| KV Operations | Write-only (populate cache) |
| Arithmetic Intensity | High (~100 FLOP/byte) |
| Batching | Full sequence parallel |
Strategy: Stream writes directly to CXL. No eviction needed—all entries are new.
🔄 Decode Phase
| Bottleneck | Memory-bound |
| Access Pattern | Random reads + 1 write |
| KV Operations | Read all + append one |
| Arithmetic Intensity | Low (~0.5 FLOP/byte) |
| Batching | Token-by-token |
Strategy: Active EMA eviction + RoPE prefetch. This is where caching matters.
Phase Detection
Endpoint monitors write/read ratio. When reads exceed writes by 10×, switch to decode-optimized policy.
Figure 10
KV-Cache Quantization Support
Modern inference increasingly uses quantized KV-caches. The endpoint supports transparent compression:
Transparent compression: GPU writes FP16 → Endpoint stores INT8 → GPU reads FP16. Compression invisible to inference stack.
Figure 11
Latency: CXL.mem vs PCIe Baseline
PCIe DMA Transfer
~13 μs
CPU in critical path
Driver + DMA setup overhead
CXL.mem Direct
~250 ns
Load/store semantics
Zero software overhead
65× Latency Improvement
By eliminating the software stack
| Stage | CXL.mem Latency |
| GPU MMU handling | ~50 ns |
| CXL protocol processing | ~30 ns |
| PCIe transmission | ~70 ns |
| Endpoint memory access | ~50 ns |
| Total | ~250 ns |
Figure 12
Layer Prefetch Pipeline
Single Endpoint Limitation: CXL x16 Gen5 = 64 GB/s. For 1.75 GB layer, transfer = 27.3 ms. Layer compute = 5.5 ms. GPU stalls 5× waiting.
| Endpoints | Aggregate BW | Layer Transfer | Result |
| 1 | 64 GB/s | 27.3 ms | ⌠GPU stalls (5× slower) |
| 3 | 192 GB/s | 9.1 ms | âš Borderline |
| 5 | 320 GB/s | 5.5 ms | ✓ Prefetch beats compute |
🖥 GPU Compute
Layer N
Layer N+1
Layer N+2
Layer N+3
Pipeline Efficiency
With 5 endpoints, prefetch completes before compute finishes.
Zero GPU stalls.
Figure 13
Software Integration Stack
↓
Runtime Layer
libcxl_kv
KV Allocation API
Hint Interface
Policy Params
↓
Driver Layer
NVIDIA Driver
CXL.mem Support
↓
Hardware Layer
CXL Switch
Multi-Endpoint
No kernel changes required. CXL memory appears as normal GPU-accessible memory. Framework changes limited to allocator layer.
Figure 14
Performance Sensitivity to Cache Hit Rate
Critical Threshold: Below 75% hit rate, flash access latency dominates. EMA + RoPE prefetch maintains 85%+ hit rate for typical workloads.
| Workload | P50 | P95 | P99 | P99.9 |
| Conversational (4K avg) | 8 ms | 15 ms | 28 ms | 85 ms |
| Document QA (32K avg) | 25 ms | 45 ms | 95 ms | 250 ms |
| Long-context (128K) | 45 ms | 120 ms | 350 ms | 1.2 s |
Figure 15
Power & Thermal Analysis
🔥
2,800 W
4× B200 GPUs
Liquid cooling required
â„
1,100 W
1× B200 + 5× Endpoints
Air cooling sufficient
| Endpoint Component | Typical Power | Peak Power |
| DDR5 (8 channels) | 40 W | 60 W |
| ARM A78 cores (8×) | 15 W | 25 W |
| CXL PHY + controller | 12 W | 18 W |
| UCIe interface | 8 W | 12 W |
| NVMe controller | 5 W | 8 W |
| Total per Endpoint | 80 W | 123 W |
Figure 16
Total Cost of Ownership (3-Year)
| Hardware | $240,000 |
| Power (3yr) | $147,000 |
| Cooling | $50,000 |
| Rack Space | $36,000 |
| Total | $473,000 |
| Hardware | $42,500 |
| Power (3yr) | $58,000 |
| Cooling | $10,000 |
| Rack Space | $18,000 |
| Total | $128,500 |
Break-Even Analysis
Endpoint architecture becomes cost-effective when:
context_length > 16K tokens AND
request_rate > 10 req/min
Figure 17
The Innovation Gap
Existing solutions address pieces of the problem. Nobody combines all four innovations:
📢
Per-Head Tracking
640 independent queues
📈
EMA Scoring
Attention-aware eviction
🧭
RoPE Prefetch
Position locality
🧠
Endpoint AI
Controller-resident logic
| Existing Solution | Type | What's Missing |
| Samsung CMM-D/CMM-B | CXL 2.0 Type-3 | No compute, no intelligence |
| XConn Apollo + GISMO | CXL 3.0 Switch | Pooling only, no eviction policy |
| vLLM PagedAttention | Software | Still GPU-memory limited |
| FlexGen | CPU/Disk offload | High latency (~10s) |
| InfiniGen | Speculative prefetch | CPU-based, limited bandwidth |
| CXL-SpecKV | CXL + speculation | No per-head tracking |
Figure 18
Summary: Key Results
| Component | Specification |
| Interface | CXL 3.0 Type-3 (memory expander) |
| Internal bandwidth | UCIe: 1+ TB/s |
| External bandwidth | CXL: 32-64 GB/s per link |
| Memory capacity | DDR5: 256-512 GB per endpoint |
| Compute | ARM/RISC-V cores for policy execution |
| Tracking granularity | Per KV-head per layer (640 queues) |
| Eviction policy | EMA attention score + recency decay |
| Prefetch strategy | RoPE-aware window [P−W, P+W] |
| Metadata overhead | ~640 MB for 128K context (~1.6%) |
The Core Insight
GPU handles parallel arithmetic. Endpoint handles memory management.
The division matches each architecture to its strengths—enabling long-context LLM inference at scale.