Chapter 7

Intelligent KV-Cache Management

Per-head tracking, EMA-based scoring, and RoPE-aware prefetching — the core algorithmic innovations that achieve 95% HBM hit rate.

Figures in Chapter

95%

HBM Hit Rate

+25%

vs LRU Baseline

7.5%

Latency Overhead

7.1 The Memory Wall Problem

KV-cache grows linearly with context length. At 128K tokens, Llama-70B requires 41 GB per user—exceeding single-GPU capacity for multi-user scenarios.

Figure 7.1 — Memory Wall Visualization Open Full Screen ↗

7.2 Attention Head Specialization

Research reveals that attention heads specialize into distinct functional roles. Treating them uniformly wastes optimization potential.

Head Type	% of Heads	Attention Pattern	Cache Strategy
Recency	~40%	Last 50-200 tokens	Keep recent context hot
Anchor	~15%	Positions 0-100 (system prompt)	Pin permanently
Retrieval	~25%	Content-based lookup	Use EMA scoring
Syntactic	~20%	Grammar patterns	Sparse, pattern-based

Figure 7.2 — Attention Locality Patterns View TSX Source ↗

7.3 Grouped Query Attention (GQA)

Modern models like Llama use GQA where multiple query heads share KV heads. Llama-70B has 64 query heads sharing 8 KV heads—an 8× reduction in KV-cache size.

Figure 7.3 — GQA Structure Explained Open Full Screen ↗

Figure 7.4 — GQA Tracking Diagram Open Full Screen ↗

7.4 Per-Head Importance Tracking

A token might be cold for recency heads (position 5000) but hot for retrieval heads (contains key information). Token-level eviction would incorrectly evict this token. Per-head tracking preserves it.

P_aggregate(position) = max_h∈heads(P_h(position)) Position survives if ANY head needs it

Figure 7.5 — Per-Head Tracking Visualization Open Full Screen ↗

Figure 7.6 — Per-Head Score Matrix Open Full Screen ↗

7.5 EMA-Based Attention Scoring

Simple LRU fails because important tokens (system prompts) may not be recently accessed but are critical. We use Exponential Moving Average to capture sustained importance:

score_t(p) = α × attention_t(p) + (1 - α) × score_t-1(p) α = 0.1 recommended (half-life ≈ 7 decode steps)

💡 Why EMA Beats LRU

System prompt at position 5 receiving 4% attention every step: LRU evicts after ~100 steps (hasn't been "accessed recently"). EMA maintains stable score of 0.04, never evicted.

Figure 7.7 — EMA Scoring Algorithm Open Full Screen ↗

Figure 7.8 — EMA Step-by-Step Calculation Open Full Screen ↗

Figure 7.9 — EMA Eviction Policy Open Full Screen ↗

7.6 RoPE-Aware Prefetching

Rotary Position Embeddings (RoPE) create distance-dependent attention decay. Attention naturally concentrates on nearby positions:

Attention(q_m, k_n) ∝ cos((m-n)θ) Attention decays with position distance |m-n|

Figure 7.10 — RoPE Distance Decay Open Full Screen ↗

Figure 7.11 — RoPE Prefetch Example Open Full Screen ↗

7.7 Combined Hit Rate Results

Each algorithmic improvement contributes to the final 95% HBM hit rate:

LRU baseline

70%

+ Anchor pinning

78% (+8%)

+ EMA scoring

85% (+7%)

+ Per-head tracking

91% (+6%)

+ RoPE prefetch

95% (+4%)

95%

Final Hit Rate

+25%

vs LRU

7.5%

Latency Overhead

← Previous Chapter 6: Preprocessing Next → Chapter 8: MoE Routing