Technical Appendix
KV-Cache Offloading for LLM Inference — Visual Reference
1. Transformer Architecture
Llama-70B consists of 80 identical layers. Each layer performs attention followed by a feed-forward transformation.
Diagram 1.1 — Layer Structure
Input Embedding
× 80 LAYERS
Self-Attention
Feed-Forward
Output Logits
Architecture Parameters
| Layers | 80 |
| Hidden dim | 8,192 |
| Query heads | 64 |
| KV heads | 8 |
| Head dim | 128 |
| FFN dim | 28,672 |
| Parameters | 70B |
2. Attention Mechanism
Each token computes Query, Key, and Value vectors. Attention scores determine how much each previous token contributes to the output.
Diagram 2.1 — Q, K, V Projections
What each vector represents:
Q — "What am I looking for?"
K — "What do I contain?"
V — "What do I contribute?"
Diagram 2.2 — Attention Score Computation
Token "France" attending to previous tokens:
Position 1
Qfrance
·
Kcapital
=
0.87
Position 3
Qfrance
·
Kfrance
=
0.45
After softmax normalization:
Output = weighted sum of V vectors
3. KV-Cache Structure
The KV-cache stores Key and Value vectors for all processed tokens, eliminating redundant computation during generation.
Diagram 3.1 — KV-Cache Organization
Layer 1
Kh0
Kh1
Kh2
Kh3
Kh4
Kh5
Kh6
Kh7
Vh0
Vh1
Vh2
Vh3
Vh4
Vh5
Vh6
Vh7
⋮ × 80 layers ⋮
Size = L × H_kv × seq_len × d_head × 2 × bytes
= 80 × 8 × seq_len × 128 × 2 × 2
= 320 KB per token
Diagram 3.2 — KV-Cache Size Scaling
B200 HBM capacity: 192 GB
4. Prefill vs Decode
The two phases of inference have fundamentally different computational characteristics.
Diagram 4.1 — Phase Comparison
Prefill Phase
Processing the prompt
✓ All tokens processed in parallel
✓ High arithmetic intensity
✓ Compute-bound
Decode Phase
Generating response
The
capital
...
→
Paris
✗ One token at a time
✗ Must read entire KV-cache
✗ Memory-bandwidth-bound
Diagram 4.2 — Decode Memory Access Pattern
Bandwidth requirement:
At 20 tokens/sec:
181 GB × 20 = 3.62 TB/s
B200 provides: 8 TB/s ✓
5. The Memory Wall
GPU memory capacity, not compute or bandwidth, becomes the limiting factor with multiple users.
Diagram 5.1 — Single User Memory Layout
B200 HBM — 192 GB
10 GB free
Model Weights — 140 GB
KV — 41 GB
Free
✓ Single user at 128K context fits
Diagram 5.2 — Multi-User Memory Explosion
B200 capacity: 192 GB
8 users need: 468 GB (2.4× over)
6. Attention Locality
Empirical measurement reveals that attention concentrates heavily on recent tokens.
Diagram 6.1 — Attention Distribution (10K Context)
Key insight:
~80%
of attention goes to
~10%
most recent tokens
Diagram 6.2 — Attention Heatmap (Simplified)
Current token attending to context
Position 0
Position N
7. RoPE: Why Locality Emerges
Rotary Position Embedding creates locality as a geometric property of how positions are encoded.
Diagram 7.1 — RoPE Rotation Concept
Each dimension pair rotates at different frequency
Position m rotates each pair by m × θᵢ
Frequency formula:
θᵢ = 10000−2i/d
θ₀ = 1.0 (fast)
θ₃₂ = 0.01 (medium)
θ₆₃ = 0.0001 (slow)
Fast dims → local patterns
Slow dims → global patterns
Diagram 7.2 — Distance-Dependent Decay
Average cosine factor by distance:
score(m, n) ∠Σᵢ cos((m − n) · θᵢ)
Small distance → cos ≈ 1 → high attention
Large distance → cos oscillates → lower attention
8. Attention Head Types
Different attention heads specialize for different functions, creating varied access patterns.
Diagram 8.1 — Head Specialization
Recency Heads
~40% of heads
Focus on last 50-200 tokens
Anchor Heads
~15% of heads
Always check position 0-100
Retrieval Heads
~25% of heads
Content-based, position-independent
Syntactic Heads
~20% of heads
Follow grammatical dependencies
Implication: A single caching policy cannot satisfy all heads. Per-head tracking required.
9. CXL Architecture
CXL provides memory expansion at lower cost and bandwidth than HBM.
Diagram 9.1 — System Topology
CXL 3.0 × 16 — 64 GB/s
Total CXL: 1 TB @ 256 GB/s aggregate
Diagram 9.2 — HBM vs CXL Comparison
|
GPU HBM |
CXL DRAM |
Ratio |
| Bandwidth |
8 TB/s |
256 GB/s |
31× less |
| Latency |
100 ns |
250 ns |
2.5× more |
| Capacity |
192 GB |
1 TB |
5× more |
| Cost per GB |
~$50 |
~$5 |
10× less |
CXL tradeoff: 10× cheaper per GB, but 31× lower bandwidth. Viable only if most accesses hit HBM.
10. Tiered Memory Hierarchy
The caching system places data in tiers based on access patterns.
Diagram 10.1 — Three-Tier Architecture
Tier 0 — HBM Pinned
Anchor zone + critical tokens
Tier 1 — HBM Evictable
Recent + high-attention tokens
Tier 2 — CXL DRAM
Cold tokens, low attention
Diagram 10.2 — Memory Layout (8 Users × 128K)
HBM — 192 GB
Model Weights — 140 GB
Hot KV
CXL — 1 TB
Cold KV — 280 GB
Available — 720 GB
11. EMA Scoring Algorithm
Exponential Moving Average tracks which tokens actually receive attention over time.
Diagram 11.1 — EMA Update Rule
scoret = α · attentiont + (1 − α) · scoret−1
Diagram 11.2 — EMA Evolution Example
System Instruction Token
Position 50 — "helpful"
Consistent attention from anchor heads:
Step 0: attn=0.04 → score=0.008
Step 1: attn=0.03 → score=0.012
Step 2: attn=0.05 → score=0.020
...
Step 100: → score=0.040
→ Stays HOT
Generic Middle Token
Position 45,000 — "the"
Rarely attended:
Step 0: attn=0.001 → score=0.0002
Step 1: attn=0.000 → score=0.0002
Step 2: attn=0.002 → score=0.0005
...
Step 100: → score=0.001
→ Evict to CXL
12. Priority Scoring Formula
Final placement decisions combine multiple signals into a single priority score.
Diagram 12.1 — Scoring Components
P(p) = 0.25 · R(p) + 0.55 · E(p) + 0.20 · N(p)
Diagram 12.2 — Tier Assignment Thresholds
Tier 2 (CXL)
Tier 1 (HBM)
Tier 0 (Pinned)
P = 0
P = 0.3
P = 0.6
P = 1.0
13. Per-Head Tracking
Scores are maintained separately for each KV-head to handle head specialization.
Diagram 13.1 — Per-Head Score Matrix
| Position |
Head 0 (recency) |
Head 1 (anchor) |
Head 2 (retrieval) |
Head 3-7 |
Aggregate |
Decision |
| 0 (system) |
0.001 |
0.089 |
0.012 |
... |
0.089 |
HBM |
| 45,000 |
0.000 |
0.002 |
0.003 |
... |
0.003 |
CXL |
| 99,950 |
0.082 |
0.004 |
0.031 |
... |
0.082 |
HBM |
Paggregate(p) = max( Ph(p) ) for all heads h
Position stays in HBM if ANY head needs it. Evict only when NO head has recent access.
14. Prefetching Strategy
Predictive prefetch loads anticipated tokens from CXL before they're needed.
Diagram 14.1 — Prefetch Targets
Anchor (0-100)
Recent (m-200 to m)
High-EMA
1. Anchor zone [0, 100]
2. Recent [m−200, m−1]
3. High-EMA positions
Diagram 14.2 — Prefetch Timing Budget
Prefetch capacity @ 256 GB/s:
7.68 GB ≈ 24,000 positions
15. Hit Rate Progression
Each algorithmic improvement increases the HBM hit rate.
Diagram 15.1 — Algorithm Contribution
Diagram 15.2 — Effective Latency
Leff = (hit_rate × LHBM) + ((1 − hit_rate) × LCXL)
At 95% hit rate:
107.5 ns
0.95 × 100 + 0.05 × 250
Overhead vs pure HBM:
+7.5%
(107.5 − 100) / 100
16. Final Results
Diagram 16.1 — System Comparison
Without CXL
Memory
192 GB
Users @ 128K
1
Cost (8 users)
$70K
Hardware
2× B200
With CXL + Tiering
Memory
1.2 TB
Users @ 128K
8+
Cost (8 users)
$45K
Hardware
1× B200 + CXL
Diagram 16.2 — Key Metrics
Slug Architecture Research — December 2025