KV Cache Offloading - Visual Diagrams & Architecture

Figure 1

The Memory Wall Problem

Large language model inference faces a fundamental bottleneck: the memory required to serve long-context requests vastly exceeds what fits in GPU high-bandwidth memory (HBM).

192 GB

B200 HBM Capacity

140 GB

Llama-70B Weights

41 GB

KV-Cache @ 128K

468 GB

Total for 8 Users

CAPACITY WALL

8 users @ 128K exceeds memory by

2.4×

BANDWIDTH WALL

CXL vs PCIe penalty

65×

NVIDIA B200 Specifications

4,500

TFLOPS (FP16)

8

TB/s HBM3e

192

GB Capacity

64

GB/s PCIe 5.0

Figure 2

KV-Cache Size vs Context Length

KV-cache = 2 Ã— L Ã— H Ã— D Ã— S Ã— bytes
Llama-70B: 2 Ã— 80 Ã— 8 Ã— 128 Ã— S Ã— 2 = 320 KB/token

Context Length	KV-Cache Size	% of B200 HBM	Status
4K tokens	1.3 GB	0.7%	âœ“ Fits easily
32K tokens	10 GB	5.2%	âœ“ Comfortable
128K tokens	41 GB	21%	âš Tight
512K tokens	164 GB	85%	âœ— Impossible
1M tokens	328 GB	171%	âœ— Impossible

The Scaling Crisis: Context lengths are expanding rapidly (GPT-4: 128K, Claude: 200K, Gemini: 1M+). KV-cache requirements grow linearly with context, but GPU memory remains fixed.

Figure 3

CXL Endpoint Architecture

A distributed endpoint is a CXL 3.0 Type-3 device combining memory, compute, and control logic into a single package.

DISTRIBUTED ENDPOINT PACKAGE (UCIe Integrated)

CXL 3.0 PROTOCOL ENGINE

CXL.mem

HDM-D/HDM-DB

CXL.io

Mailbox/Config

CXL.cache

Coherency

UCIe 1.1 â€” 1+ TB/s Die-to-Die Interconnect

MEMORY CONTROLLER

CH 0-3: DDR5-6400

CH 4-7: DDR5-6400

8ch Ã— 51.2 GB/s = 409.6 GB/s

COMPUTE CHIPLET

Core 0-3: ARM A78 @ 3GHz

Core 4-7: ARM A78 @ 3GHz

L3 Cache â€” 8 MB Shared

CONTROL & POLICY

EMA Scoring Engine

Per-Head Access Tracker

RoPE Prefetch Queue

512 GB

DDR5 DRAM

16 TB

NVMe Flash

~250 ns

CXL Latency

80 W

Typical Power

Figure 4

Tiered Memory Architecture

Hot (HBM)
192 GB
8 TB/s

Warm (CXL)
1 TB
~250 ns

Cold (Flash)
16 TB
~25 Î¼s

Total
~17 TB

Bandwidth Hierarchy

UCIe Internal

1+ TB/s

DDR5 Local

409.6 GB/s

CXL External

32-64 GB/s

NVMe Flash

~14 GB/s

Key Insight

GPU sees unified address space. Endpoint manages tier placement transparently.
CXL.mem provides load/store semanticsâ€”no explicit I/O commands, no DMA setup, no driver intervention.

Figure 5

CXL 3.0 Coherency Protocol

CXL 3.0 provides hardware-managed coherency through the Back-Invalidate (BI) protocol.

Invalid

GPU cache empty
Endpoint authoritative

Shared

GPU has read copy
Endpoint authoritative

Exclusive

GPU can write
Endpoint stale

Modified

GPU has dirty data
Must writeback

GPU â†’ Endpoint Writes

During prefill, GPU writes new KV entries. Endpoint receives D2H Write with data, updates local DRAM and clears stale metadata.

Endpoint â†’ GPU Invalidation

When endpoint evicts entries to flash, it issues BI-Snoop. GPU must writeback dirty data before acknowledging.

Concurrent SM Access: Multiple GPU SMs accessing the same KV-head are serialized at L2 cache. Endpoint sees unified coherent viewâ€”no per-SM tracking required.

Figure 6

Attention Mechanisms: MHA vs GQA vs MQA

MHA

Multi-Head Attention

Q heads = K heads = V heads

64 KV heads

Full memory cost

GQA
Grouped Query Attention
Multiple Q share K/V
8 KV heads
8Ã— memory savings

MQA

Multi-Query Attention

All Q share single K/V

1 KV head

Quality tradeoff

8 KV-heads Ã— 80 layers = 640 independent eviction policies

640

LRU Queues

131K

Entries/Queue

8 B

Bytes/Entry

640 MB

Total Metadata

Figure 7

EMA-Based Eviction Algorithm

Why LRU fails: LRU assumes recent access predicts future access. Attention violates thisâ€”a token at position 1,000 may not be accessed until position 100,000, but remains critically important.

score_ema = Î± Ã— new_score + (1 âˆ’ Î±) Ã— score_ema

Î± â†’ 1.0 (Reactive)

Trust recent scores. Good for bursty access patterns.

Î± â†’ 0.1 (Stable)

Trust history. Good for persistent anchors.

priority = (1 âˆ’ score_ema) Ã— recency_decay
Higher priority â†’ evict sooner

Token A: Important Anchor

Position	1,024
Last access	50 steps ago
score_ema	0.211
recency_decay	0.049
priority	0.039

âœ“ KEEP IN CACHE

Token B: Low Attention

Position	45,678
Last access	2,000 steps ago
score_ema	0.08
recency_decay	0.865
priority	0.796

🗑 EVICT TO FLASH

Figure 8

RoPE-Aware Prefetch Strategy

🔄 Rotary Encoding

RoPE encodes position by rotating Q/K vectors. Nearby positions have similar rotations â†’ higher dot product â†’ higher attention.

📍 Locality Bias

On average, ~70% of attention mass falls within Â±W positions of the query token.

🎯 Predictable Access

If GPU requests position P, it will likely need PÂ±W soon. Prefetch proactively.

Prefetch Rule: GPU accesses position P â†’ Prefetch [P âˆ’ W, P + W]

85%

Cache Hit Rate

72%

Attention Captured

3.2Ã—

Latency Reduction

1.4Ã—

BW Overhead

Figure 9

Prefill vs Decode Phase Characteristics

âš¡ Prefill Phase

Bottleneck	Compute-bound
Access Pattern	Sequential writes
KV Operations	Write-only (populate cache)
Arithmetic Intensity	High (~100 FLOP/byte)
Batching	Full sequence parallel

Strategy: Stream writes directly to CXL. No eviction neededâ€”all entries are new.

🔄 Decode Phase

Bottleneck	Memory-bound
Access Pattern	Random reads + 1 write
KV Operations	Read all + append one
Arithmetic Intensity	Low (~0.5 FLOP/byte)
Batching	Token-by-token

Strategy: Active EMA eviction + RoPE prefetch. This is where caching matters.

Phase Detection

Endpoint monitors write/read ratio. When reads exceed writes by 10Ã—, switch to decode-optimized policy.

Figure 10

KV-Cache Quantization Support

Modern inference increasingly uses quantized KV-caches. The endpoint supports transparent compression:

FP16

Baseline

40 GB

FP8

<0.1% loss

20 GB

INT8

<0.5% loss

20 GB

INT4

~1% loss

10 GB

48

GB/s FP16â†’INT8

32

GB/s FP16â†’INT4

64

GB/s INT8â†’FP16

Transparent compression: GPU writes FP16 â†’ Endpoint stores INT8 â†’ GPU reads FP16. Compression invisible to inference stack.

Figure 11

Latency: CXL.mem vs PCIe Baseline

PCIe DMA Transfer

~13 Î¼s

CPU in critical path
Driver + DMA setup overhead

CXL.mem Direct

~250 ns

Load/store semantics
Zero software overhead

65Ã— Latency Improvement

By eliminating the software stack

Stage	CXL.mem Latency
GPU MMU handling	~50 ns
CXL protocol processing	~30 ns
PCIe transmission	~70 ns
Endpoint memory access	~50 ns
Total	~250 ns

Figure 12

Layer Prefetch Pipeline

Single Endpoint Limitation: CXL x16 Gen5 = 64 GB/s. For 1.75 GB layer, transfer = 27.3 ms. Layer compute = 5.5 ms. GPU stalls 5Ã— waiting.

Endpoints	Aggregate BW	Layer Transfer	Result
1	64 GB/s	27.3 ms	âŒ GPU stalls (5Ã— slower)
3	192 GB/s	9.1 ms	âš Borderline
5	320 GB/s	5.5 ms	âœ“ Prefetch beats compute

🖥 GPU Compute

Layer N

Layer N+1

Layer N+2

Layer N+3

📡 CXL Prefetch

N+1

N+2

N+3

N+4

Pipeline Efficiency

With 5 endpoints, prefetch completes before compute finishes. Zero GPU stalls.

Figure 13

Software Integration Stack

Application Layer

vLLM

PagedAttention

TensorRT-LLM

Plugin API

SGLang

RadixAttention

â†“

Runtime Layer

libcxl_kv

KV Allocation API

CUDA UVM

Unified Memory

Hint Interface

Policy Params

â†“

Driver Layer

CXL Driver

Linux 6.8+

NVIDIA Driver

CXL.mem Support

â†“

Hardware Layer

B200 GPU

CXL 3.0 Host

CXL Switch

Multi-Endpoint

Endpoints

Type-3 Devices

No kernel changes required. CXL memory appears as normal GPU-accessible memory. Framework changes limited to allocator layer.

Figure 14

Performance Sensitivity to Cache Hit Rate

95% hit

28 ms

90% hit

45 ms

85% hit

72 ms

80% hit

98 ms

70% hit

185 ms

60% hit

340 ms

Critical Threshold: Below 75% hit rate, flash access latency dominates. EMA + RoPE prefetch maintains 85%+ hit rate for typical workloads.

Workload	P50	P95	P99	P99.9
Conversational (4K avg)	8 ms	15 ms	28 ms	85 ms
Document QA (32K avg)	25 ms	45 ms	95 ms	250 ms
Long-context (128K)	45 ms	120 ms	350 ms	1.2 s

Figure 15

Power & Thermal Analysis

🔥

2,800 W

4Ã— B200 GPUs

Liquid cooling required

â„

1,100 W

1× B200 + 5Ã— Endpoints

Air cooling sufficient

Endpoint Component	Typical Power	Peak Power
DDR5 (8 channels)	40 W	60 W
ARM A78 cores (8Ã—)	15 W	25 W
CXL PHY + controller	12 W	18 W
UCIe interface	8 W	12 W
NVMe controller	5 W	8 W
Total per Endpoint	80 W	123 W

2.5Ã—

Power Efficiency

Air

Cooling Type

45Â°C

Junction Temp

1U

Form Factor

Figure 16

Total Cost of Ownership (3-Year)

GPU-Only (8×B200)

Hardware	$240,000
Power (3yr)	$147,000
Cooling	$50,000
Rack Space	$36,000
Total	$473,000

1Ã— GPU + 5Ã— Endpoints

Hardware	$42,500
Power (3yr)	$58,000
Cooling	$10,000
Rack Space	$18,000
Total	$128,500

3.7Ã—

TCO Reduction

73%

CapEx Savings

8 mo

Payback Period

$0.0004

Cost/Token

Break-Even Analysis

Endpoint architecture becomes cost-effective when: context_length > 16K tokens AND request_rate > 10 req/min

Figure 17

The Innovation Gap

Existing solutions address pieces of the problem. Nobody combines all four innovations:

📢

Per-Head Tracking

640 independent queues

📈

EMA Scoring

Attention-aware eviction

🧭

RoPE Prefetch

Position locality

🧠

Endpoint AI

Controller-resident logic

Existing Solution	Type	What's Missing
Samsung CMM-D/CMM-B	CXL 2.0 Type-3	No compute, no intelligence
XConn Apollo + GISMO	CXL 3.0 Switch	Pooling only, no eviction policy
vLLM PagedAttention	Software	Still GPU-memory limited
FlexGen	CPU/Disk offload	High latency (~10s)
InfiniGen	Speculative prefetch	CPU-based, limited bandwidth
CXL-SpecKV	CXL + speculation	No per-head tracking

Figure 18

Summary: Key Results

65Ã—

Latency vs PCIe

85%