Β© 2025 Subramaniyam (Sam) Pooni
All Rights Reserved
Proprietary & Confidential
Chapter 7

Intelligent KV-Cache Management

Per-head tracking, EMA-based scoring, and RoPE-aware prefetching β€” the core algorithmic innovations that achieve 95% HBM hit rate.

27
Figures in Chapter
95%
HBM Hit Rate
+25%
vs LRU Baseline
7.5%
Latency Overhead

7.1 The Memory Wall Problem

KV-cache grows linearly with context length. At 128K tokens, Llama-70B requires 41 GB per userβ€”exceeding single-GPU capacity for multi-user scenarios.

Figure 7.1 β€” Memory Wall Visualization Open Full Screen β†—

7.2 Attention Head Specialization

Research reveals that attention heads specialize into distinct functional roles. Treating them uniformly wastes optimization potential.

Head Type% of HeadsAttention PatternCache Strategy
Recency~40%Last 50-200 tokensKeep recent context hot
Anchor~15%Positions 0-100 (system prompt)Pin permanently
Retrieval~25%Content-based lookupUse EMA scoring
Syntactic~20%Grammar patternsSparse, pattern-based
Figure 7.2 β€” Attention Locality Patterns View TSX Source β†—

7.3 Grouped Query Attention (GQA)

Modern models like Llama use GQA where multiple query heads share KV heads. Llama-70B has 64 query heads sharing 8 KV headsβ€”an 8Γ— reduction in KV-cache size.

Figure 7.3 β€” GQA Structure Explained Open Full Screen β†—
Figure 7.4 β€” GQA Tracking Diagram Open Full Screen β†—

7.4 Per-Head Importance Tracking

A token might be cold for recency heads (position 5000) but hot for retrieval heads (contains key information). Token-level eviction would incorrectly evict this token. Per-head tracking preserves it.

Paggregate(position) = maxh∈heads(Ph(position)) Position survives if ANY head needs it
Figure 7.5 β€” Per-Head Tracking Visualization Open Full Screen β†—
Figure 7.6 β€” Per-Head Score Matrix Open Full Screen β†—

7.5 EMA-Based Attention Scoring

Simple LRU fails because important tokens (system prompts) may not be recently accessed but are critical. We use Exponential Moving Average to capture sustained importance:

scoret(p) = Ξ± Γ— attentiont(p) + (1 - Ξ±) Γ— scoret-1(p) Ξ± = 0.1 recommended (half-life β‰ˆ 7 decode steps)
πŸ’‘ Why EMA Beats LRU

System prompt at position 5 receiving 4% attention every step: LRU evicts after ~100 steps (hasn't been "accessed recently"). EMA maintains stable score of 0.04, never evicted.

Figure 7.7 β€” EMA Scoring Algorithm Open Full Screen β†—
Figure 7.8 β€” EMA Step-by-Step Calculation Open Full Screen β†—
Figure 7.9 β€” EMA Eviction Policy Open Full Screen β†—

7.6 RoPE-Aware Prefetching

Rotary Position Embeddings (RoPE) create distance-dependent attention decay. Attention naturally concentrates on nearby positions:

Attention(qm, kn) ∝ cos((m-n)θ) Attention decays with position distance |m-n|
Figure 7.10 β€” RoPE Distance Decay Open Full Screen β†—
Figure 7.11 β€” RoPE Prefetch Example Open Full Screen β†—

7.7 Combined Hit Rate Results

Each algorithmic improvement contributes to the final 95% HBM hit rate:

LRU baseline
70%
+ Anchor pinning
78% (+8%)
+ EMA scoring
85% (+7%)
+ Per-head tracking
91% (+6%)
+ RoPE prefetch
95% (+4%)
95%
Final Hit Rate
+25%
vs LRU
7.5%
Latency Overhead