Per-Head Attention Tracking

Different attention heads serve different purposes â€” one cache policy doesn't fit all

The Four Types of Attention Heads

Recency Head

Head 0, Layer 40

Focuses on last 50-100 tokens. Looking for immediate context.

Old Recent

Anchor Head

Head 1, Layer 40

Always attends to first 50 tokens. Looking for system instructions.

Start End

Retrieval Head

Head 2, Layer 40

Attends based on semantic similarity. Finds relevant info anywhere.

Variable Spiky

Syntactic Head

Head 3, Layer 40

Attends to grammatically related tokens. Subject-verb agreement, etc.

Structured Pattern

âŒ Problem with Global Caching

A single "keep recent tokens" policy works for recency heads but fails for:

Anchor heads (need start of prompt)
Retrieval heads (need specific distant tokens)

âœ“ Solution: Per-Head Tracking

Track which tokens each head needs separately. A token stays cached if ANY head still needs it. This ensures anchor tokens stay cached even though recency heads don't need them.

Example: 100K Context Document QA

Context structure:

System
Prompt

Document (50K tokens)

"France"

Conversation (40K tokens)

Pos 0-100

Pos 100-50K

Pos 50K (key fact)

Pos 50K-100K

Query: "What is the capital of France?" â€” Now generating answer...

Recency Head

Needs: Last 100 tokens
(recent conversation)

Anchor Head

Needs: Pos 0-100
(system prompt)

Retrieval Head

Needs: Pos 50K
(where "France" mentioned)

Syntactic Head

Needs: Question words
(recent "What is")

Combined Cache Decision

Keep in HBM: Pos 0-100 (anchor) + Pos ~50K (France mention) + Last 1000 tokens (recent)
Everything else â†’ CXL DRAM (accessed less frequently)