EMA-Based Attention Scoring

How we track which tokens are actually important (not just recent)

The EMA Update Rule
score_new = 0.2 × attention + 0.8 × score_old
Each decode step, we update the score using exponential moving average.
Tokens that consistently receive attention accumulate high scores.
Tokens that are ignored see their scores decay toward zero.
Token: "helpful" (instruction)
Position 45
Step 1
0.2 × 0.04 + 0.8 × 0
0.008
Step 2
0.2 × 0.05 + 0.8 × 0.008
0.016
Step 10
0.2 × 0.03 + 0.8 × 0.042
0.040
Step 50
0.2 × 0.04 + 0.8 × 0.11
0.096
Step 100
0.2 × 0.05 + 0.8 × 0.14
0.122
Final Score after 100 steps
0.122 → KEEP
Token: "the" (filler word)
Position 5,432
Step 1
0.2 × 0.002 + 0.8 × 0
0.0004
Step 2
0.2 × 0.001 + 0.8 × 0.0004
0.0005
Step 10
0.2 × 0.001 + 0.8 × 0.0008
0.0008
Step 50
0.2 × 0.000 + 0.8 × 0.0006
0.0005
Step 100
0.2 × 0.001 + 0.8 × 0.0004
0.0005
Final Score after 100 steps
0.0005 → EVICT
Why α = 0.2? The Half-Life Calculation
(1 - α)^n = 0.5
0.8^n = 0.5
n = log(0.5) / log(0.8)
n = 3.1 steps
At 20 tokens/second, the half-life is 155 milliseconds.

This means: if a token doesn't receive attention for 155ms, its score drops by 50%. Tokens that are consistently important maintain high scores; tokens that were briefly accessed but then ignored see their scores decay quickly.
Eviction Decision Thresholds
> 0.10
KEEP
Store in fast HBM
0.02 - 0.10
DEMOTE
Move to CXL DRAM
< 0.02
EVICT
Move to Flash or discard