Why Attention Has Locality — The RoPE Effect

Understanding why recent tokens receive more attention than distant ones

Attention Pattern: Token at Position 1000 Looking Back

Attention weights when generating the 1000th token:

Last 10 tokens
45%
pos 990-999
Tokens 11-100
28%
pos 900-989
Tokens 101-500
18%
pos 500-899
Tokens 501-950
7%
pos 50-499
First 50 tokens
pos 0-49
Key Observation
73% of attention goes to the last 10% of tokens
This is NOT random — it's caused by how positions are encoded (RoPE)
How RoPE Creates Locality
Distance from current token Attention d=10 d=100 d=500 d=1000
attention ∝ cos((m - n) × θ)
m = query position
n = key position
θ = rotation frequency

When distance (m-n) is small → cos ≈ 1 → high attention
When distance is large → cos oscillates → lower average

💡 The Caching Implication

Since ~80% of attention goes to the most recent ~10% of tokens, we can keep recent tokens in fast HBM memory and move older tokens to slower CXL memory. Most accesses will hit the fast tier, keeping average latency low.