Per-Head Attention Tracking

Different attention heads serve different purposes — one cache policy doesn't fit all

The Four Types of Attention Heads
Recency Head
Head 0, Layer 40
Focuses on last 50-100 tokens. Looking for immediate context.
Old Recent
Anchor Head
Head 1, Layer 40
Always attends to first 50 tokens. Looking for system instructions.
Start End
Retrieval Head
Head 2, Layer 40
Attends based on semantic similarity. Finds relevant info anywhere.
Variable Spiky
Syntactic Head
Head 3, Layer 40
Attends to grammatically related tokens. Subject-verb agreement, etc.
Structured Pattern
❌ Problem with Global Caching
A single "keep recent tokens" policy works for recency heads but fails for:
  • Anchor heads (need start of prompt)
  • Retrieval heads (need specific distant tokens)
✓ Solution: Per-Head Tracking
Track which tokens each head needs separately. A token stays cached if ANY head still needs it. This ensures anchor tokens stay cached even though recency heads don't need them.
Example: 100K Context Document QA
Context structure:
System
Prompt
Document (50K tokens)
"France"
Conversation (40K tokens)
Pos 0-100
Pos 100-50K
Pos 50K (key fact)
Pos 50K-100K

Query: "What is the capital of France?" — Now generating answer...

Recency Head
Needs: Last 100 tokens
(recent conversation)
Anchor Head
Needs: Pos 0-100
(system prompt)
Retrieval Head
Needs: Pos 50K
(where "France" mentioned)
Syntactic Head
Needs: Question words
(recent "What is")
Combined Cache Decision
Keep in HBM: Pos 0-100 (anchor) + Pos ~50K (France mention) + Last 1000 tokens (recent)
Everything else → CXL DRAM (accessed less frequently)