Technical Appendix

KV-Cache Offloading for LLM Inference — Visual Reference

1. Transformer Architecture

Llama-70B consists of 80 identical layers. Each layer performs attention followed by a feed-forward transformation.

Diagram 1.1 — Layer Structure
Input Embedding
× 80 LAYERS
Self-Attention
Feed-Forward
Output Logits
Architecture Parameters
Layers80
Hidden dim8,192
Query heads64
KV heads8
Head dim128
FFN dim28,672
Parameters70B

2. Attention Mechanism

Each token computes Query, Key, and Value vectors. Attention scores determine how much each previous token contributes to the output.

Diagram 2.1 — Q, K, V Projections
Input x
8,192 dims
W_Q
Query
W_K
Key
W_V
Value
Q
128d
K
128d
V
128d
What each vector represents:
Q — "What am I looking for?"
K — "What do I contain?"
V — "What do I contribute?"
Diagram 2.2 — Attention Score Computation
Token "France" attending to previous tokens:
Position 0
Qfrance · Kthe =
0.12
Position 1
Qfrance · Kcapital =
0.87
Position 2
Qfrance · Kof =
0.23
Position 3
Qfrance · Kfrance =
0.45
After softmax normalization:
The
6%
capital
52%
of
14%
France
28%
Output = weighted sum of V vectors

3. KV-Cache Structure

The KV-cache stores Key and Value vectors for all processed tokens, eliminating redundant computation during generation.

Diagram 3.1 — KV-Cache Organization
Layer 1
Kh0
Kh1
Kh2
Kh3
Kh4
Kh5
Kh6
Kh7
Vh0
Vh1
Vh2
Vh3
Vh4
Vh5
Vh6
Vh7
⋮ × 80 layers ⋮
Size = L × H_kv × seq_len × d_head × 2 × bytes = 80 × 8 × seq_len × 128 × 2 × 2 = 320 KB per token
Diagram 3.2 — KV-Cache Size Scaling
4K tokens
1.3 GB
32K tokens
10 GB
128K tokens
41 GB
512K tokens
164 GB
1M tokens
328 GB
B200 HBM capacity: 192 GB

4. Prefill vs Decode

The two phases of inference have fundamentally different computational characteristics.

Diagram 4.1 — Phase Comparison
Prefill Phase
Processing the prompt
The
capital
of
France
is
✓ All tokens processed in parallel
✓ High arithmetic intensity
✓ Compute-bound
Decode Phase
Generating response
The
capital
...
→
Paris
✗ One token at a time
✗ Must read entire KV-cache
✗ Memory-bandwidth-bound
Diagram 4.2 — Decode Memory Access Pattern
To generate 1 token:
Model Weights
140 GB
KV-Cache
41 GB (at 128K)
Total reads
181 GB
Bandwidth requirement:
At 20 tokens/sec: 181 GB × 20 = 3.62 TB/s B200 provides: 8 TB/s ✓

5. The Memory Wall

GPU memory capacity, not compute or bandwidth, becomes the limiting factor with multiple users.

Diagram 5.1 — Single User Memory Layout
B200 HBM — 192 GB 10 GB free
Model Weights — 140 GB
KV — 41 GB
Free
✓ Single user at 128K context fits
Diagram 5.2 — Multi-User Memory Explosion
2 users 222 GB needed
Weights
KV ×2
4 users 304 GB needed
Weights
KV ×4
8 users 468 GB needed
Weights
KV ×8
B200 capacity: 192 GB
8 users need: 468 GB (2.4× over)

6. Attention Locality

Empirical measurement reveals that attention concentrates heavily on recent tokens.

Diagram 6.1 — Attention Distribution (10K Context)
0-1K
5%
1-9K
15%
9-10K
80%
Key insight:
~80%
of attention goes to
~10%
most recent tokens
Diagram 6.2 — Attention Heatmap (Simplified)
Current token attending to context
← Earlier | Recent →
Position 0 Position N
Low attention
Medium
High attention

7. RoPE: Why Locality Emerges

Rotary Position Embedding creates locality as a geometric property of how positions are encoded.

Diagram 7.1 — RoPE Rotation Concept
Each dimension pair rotates at different frequency
θ₀ = 1.0 θ₃₂ θ₆₃ = 0.0001
Position m rotates each pair by m × θᵢ
Frequency formula:
θᵢ = 10000−2i/d θ₀ = 1.0 (fast) θ₃₂ = 0.01 (medium) θ₆₃ = 0.0001 (slow)
Fast dims → local patterns
Slow dims → global patterns
Diagram 7.2 — Distance-Dependent Decay
Average cosine factor by distance:
Distance 1
0.99
Distance 10
0.95
Distance 100
0.71
Distance 1,000
0.32
Distance 10,000
0.11
score(m, n) ∝ Σᵢ cos((m − n) · θᵢ) Small distance → cos ≈ 1 → high attention Large distance → cos oscillates → lower attention

8. Attention Head Types

Different attention heads specialize for different functions, creating varied access patterns.

Diagram 8.1 — Head Specialization
Recency Heads
~40% of heads
Focus on last 50-200 tokens
Anchor Heads
~15% of heads
Always check position 0-100
Retrieval Heads
~25% of heads
Content-based, position-independent
Syntactic Heads
~20% of heads
Follow grammatical dependencies
Implication: A single caching policy cannot satisfy all heads. Per-head tracking required.

9. CXL Architecture

CXL provides memory expansion at lower cost and bandwidth than HBM.

Diagram 9.1 — System Topology
NVIDIA B200
192 GB
HBM Capacity
8 TB/s
Bandwidth
100ns
Latency
CXL 3.0 × 16 — 64 GB/s
CXL Switch / Fabric
EP 0
256 GB
EP 1
256 GB
EP 2
256 GB
EP 3
256 GB
Total CXL: 1 TB @ 256 GB/s aggregate
Diagram 9.2 — HBM vs CXL Comparison
GPU HBM CXL DRAM Ratio
Bandwidth 8 TB/s 256 GB/s 31× less
Latency 100 ns 250 ns 2.5× more
Capacity 192 GB 1 TB 5× more
Cost per GB ~$50 ~$5 10× less
CXL tradeoff: 10× cheaper per GB, but 31× lower bandwidth. Viable only if most accesses hit HBM.

10. Tiered Memory Hierarchy

The caching system places data in tiers based on access patterns.

Diagram 10.1 — Three-Tier Architecture
Tier 0 — HBM Pinned
Anchor zone + critical tokens
5 GB
100 ns
Tier 1 — HBM Evictable
Recent + high-attention tokens
37 GB
100 ns
Tier 2 — CXL DRAM
Cold tokens, low attention
280 GB
250 ns
Diagram 10.2 — Memory Layout (8 Users × 128K)
HBM — 192 GB
Model Weights — 140 GB
Hot KV
Weights
Activations
Pinned KV
Hot KV
CXL — 1 TB
Cold KV — 280 GB
Available — 720 GB

11. EMA Scoring Algorithm

Exponential Moving Average tracks which tokens actually receive attention over time.

Diagram 11.1 — EMA Update Rule
scoret = α · attentiont + (1 − α) · scoret−1
α = 0.2
Decay factor
3.1 steps
Half-life
~155 ms
At 20 tok/s
Diagram 11.2 — EMA Evolution Example
System Instruction Token
Position 50 — "helpful"
Consistent attention from anchor heads:
Step 0: attn=0.04 → score=0.008
Step 1: attn=0.03 → score=0.012
Step 2: attn=0.05 → score=0.020
...
Step 100: → score=0.040
→ Stays HOT
Generic Middle Token
Position 45,000 — "the"
Rarely attended:
Step 0: attn=0.001 → score=0.0002
Step 1: attn=0.000 → score=0.0002
Step 2: attn=0.002 → score=0.0005
...
Step 100: → score=0.001
→ Evict to CXL

12. Priority Scoring Formula

Final placement decisions combine multiple signals into a single priority score.

Diagram 12.1 — Scoring Components
Recency
25%
EMA Score
55%
Anchor Zone
20%
P(p) = 0.25 · R(p) + 0.55 · E(p) + 0.20 · N(p)
Diagram 12.2 — Tier Assignment Thresholds
Tier 2 (CXL)
Tier 1 (HBM)
Tier 0 (Pinned)
P = 0 P = 0.3 P = 0.6 P = 1.0

13. Per-Head Tracking

Scores are maintained separately for each KV-head to handle head specialization.

Diagram 13.1 — Per-Head Score Matrix
Position Head 0
(recency)
Head 1
(anchor)
Head 2
(retrieval)
Head 3-7 Aggregate Decision
0 (system) 0.001 0.089 0.012 ... 0.089 HBM
45,000 0.000 0.002 0.003 ... 0.003 CXL
99,950 0.082 0.004 0.031 ... 0.082 HBM
Paggregate(p) = max( Ph(p) ) for all heads h
Position stays in HBM if ANY head needs it. Evict only when NO head has recent access.

14. Prefetching Strategy

Predictive prefetch loads anticipated tokens from CXL before they're needed.

Diagram 14.1 — Prefetch Targets
Anchor (0-100)
Recent (m-200 to m)
High-EMA
1. Anchor zone [0, 100]
2. Recent [m−200, m−1]
3. High-EMA positions
Diagram 14.2 — Prefetch Timing Budget
Token Generation
50 ms
−
Compute
20 ms
=
Prefetch Window
30 ms
Prefetch capacity @ 256 GB/s:
7.68 GB ≈ 24,000 positions

15. Hit Rate Progression

Each algorithmic improvement increases the HBM hit rate.

Diagram 15.1 — Algorithm Contribution
LRU baseline
70%
+ Anchor pinning
78%
+ EMA scoring
85%
+ Per-head tracking
91%
+ Prefetching
95%
Diagram 15.2 — Effective Latency
Leff = (hit_rate × LHBM) + ((1 − hit_rate) × LCXL)
At 95% hit rate:
107.5 ns
0.95 × 100 + 0.05 × 250
Overhead vs pure HBM:
+7.5%
(107.5 − 100) / 100

16. Final Results

Diagram 16.1 — System Comparison
Without CXL
Memory 192 GB
Users @ 128K 1
Cost (8 users) $70K
Hardware 2× B200
With CXL + Tiering
Memory 1.2 TB
Users @ 128K 8+
Cost (8 users) $45K
Hardware 1× B200 + CXL
Diagram 16.2 — Key Metrics
6×
Memory Expansion
8×
User Capacity
36%
Cost Reduction
7.5%
Latency Overhead
95%
HBM Hit Rate
33
Tokens/sec/user
Slug Architecture Research — December 2025