Recency heads, anchor heads, retrieval heads, and syntactic heads.
Research reveals that attention heads specialize into distinct functional roles during training:
| Type | % of Heads | Attention Pattern | Cache Implication |
|---|---|---|---|
| Recency | ~40% | Last 50-200 tokens | Keep recent context hot |
| Anchor | ~15% | Positions 0-100 (system prompt) | Pin anchor zone permanently |
| Retrieval | ~25% | Content-based lookup | Use EMA scoring |
| Syntactic | ~20% | Grammar patterns | Sparse, pattern-based |
A token might be:
Token-level eviction would incorrectly evict this token. Per-head tracking preserves it.
Position survives if any head needs it.