Per-head tracking, EMA-based scoring, and RoPE-aware prefetching β the core algorithmic innovations that achieve 95% HBM hit rate.
KV-cache grows linearly with context length. At 128K tokens, Llama-70B requires 41 GB per userβexceeding single-GPU capacity for multi-user scenarios.
Research reveals that attention heads specialize into distinct functional roles. Treating them uniformly wastes optimization potential.
| Head Type | % of Heads | Attention Pattern | Cache Strategy |
|---|---|---|---|
| Recency | ~40% | Last 50-200 tokens | Keep recent context hot |
| Anchor | ~15% | Positions 0-100 (system prompt) | Pin permanently |
| Retrieval | ~25% | Content-based lookup | Use EMA scoring |
| Syntactic | ~20% | Grammar patterns | Sparse, pattern-based |
Modern models like Llama use GQA where multiple query heads share KV heads. Llama-70B has 64 query heads sharing 8 KV headsβan 8Γ reduction in KV-cache size.
A token might be cold for recency heads (position 5000) but hot for retrieval heads (contains key information). Token-level eviction would incorrectly evict this token. Per-head tracking preserves it.
Simple LRU fails because important tokens (system prompts) may not be recently accessed but are critical. We use Exponential Moving Average to capture sustained importance:
System prompt at position 5 receiving 4% attention every step: LRU evicts after ~100 steps (hasn't been "accessed recently"). EMA maintains stable score of 0.04, never evicted.
Rotary Position Embeddings (RoPE) create distance-dependent attention decay. Attention naturally concentrates on nearby positions:
Each algorithmic improvement contributes to the final 95% HBM hit rate: