Attention-Aware Cache Eviction

Using EMA to transform noisy attention into stable eviction priorities

1
Observe Attention Scores
0.45
t₁
0.12
tâ‚‚
0.30
t₃
0.08
tâ‚„
0.35
tâ‚…

Each decode step, every cached token receives an attention weight from the current query. These scores are noisy—a token might spike one step and drop the next.

📍 Tracking: Position 1024
2
Smooth with Exponential Moving Average
score_ema = α × new_score + (1 − α) × score_ema
α → 1.0
Reactive: trust recent scores (bursty patterns)
α → 0.1
Stable: trust history (important anchors)
3
Step-by-Step Calculation (α = 0.2)
Stepnew_scoreCalculationscore_ema
t₁0.450.2 × 0.45 + 0.8 × 0.000.090
tâ‚‚0.120.2 × 0.12 + 0.8 × 0.1350.096
t₃0.300.2 × 0.30 + 0.8 × 0.1310.137
tâ‚„0.080.2 × 0.08 + 0.8 × 0.1820.126
tâ‚…0.350.2 × 0.35 + 0.8 × 0.1510.171
Raw Average
0.26
→
EMA (α=0.2)
0.171
4
Compute Eviction Priority
priority = (1 − score_ema) × recency_decay
Higher priority → evict sooner
recency_decay = 1 − e−β × steps
β = 0.001, ~63% decay at 1000 steps
A pos=1024, accessed 50 steps ago
score_ema0.171
recency_decay0.049
priority(1−0.211) × 0.049 = 0.039
✓ KEEP
B pos=45678, accessed 2000 steps ago
score_ema0.08
recency_decay0.865
priority(1−0.08) × 0.865 = 0.796
🗑 EVICT
5
Decision Matrix
eviction_priority = f(recency, score_ema)
High Score + Old
âš  Watch
System Prompt Token
ema=0.72 · 3000 steps · p=0.23
Was critical, might be again
High Score + Recent
✓ Keep
Recent Context Token
ema=0.85 · 20 steps · p=0.003
Actively used, high value
Low Score + Old
🗑 Evict
Filler Word "the"
ema=0.03 · 5000 steps · p=0.91
Never important, stale
Low Score + Recent
âš  Watch
New User Input
ema=0.05 · 10 steps · p=0.009
Just arrived, give it time
← Old | Recent → → High Score | Low Score ↓