Attention Patterns & RoPE Locality

Attention Pattern: Token at Position 1000 Looking Back

Attention weights when generating the 1000th token:

Last 10 tokens

45%

pos 990-999

Tokens 11-100

28%

pos 900-989

Tokens 101-500

18%

pos 500-899

Tokens 501-950

pos 50-499

First 50 tokens

pos 0-49

Key Observation

73% of attention goes to the last 10% of tokens

This is NOT random â€” it's caused by how positions are encoded (RoPE)

How RoPE Creates Locality

attention âˆ cos((m - n) Ã— Î¸)

m = query position
n = key position
Î¸ = rotation frequency

When distance (m-n) is small â†’ cos â‰ˆ 1 â†’ high attention
When distance is large â†’ cos oscillates â†’ lower average

💡 The Caching Implication

Since ~80% of attention goes to the most recent ~10% of tokens, we can keep recent tokens in fast HBM memory and move older tokens to slower CXL memory. Most accesses will hit the fast tier, keeping average latency low.

Why Attention Has Locality â€” The RoPE Effect

💡 The Caching Implication