Attention-Aware Eviction

1

Observe Attention Scores

0.45

tâ‚

0.12

tâ‚‚

0.30

tâ‚ƒ

0.08

tâ‚„

0.35

tâ‚…

Each decode step, every cached token receives an attention weight from the current query. These scores are noisyâ€”a token might spike one step and drop the next.

📍 Tracking: Position 1024

2

Smooth with Exponential Moving Average

score_ema = Î± Ã— new_score + (1 âˆ’ Î±) Ã— score_ema

Î± â†’ 1.0

Reactive: trust recent scores (bursty patterns)

Î± â†’ 0.1

Stable: trust history (important anchors)

3

Step-by-Step Calculation (α = 0.2)

Step	new_score	Calculation	score_ema
tâ‚	0.45	0.2 × 0.45 + 0.8 × 0.00	0.090
tâ‚‚	0.12	0.2 × 0.12 + 0.8 × 0.135	0.096
tâ‚ƒ	0.30	0.2 × 0.30 + 0.8 × 0.131	0.137
tâ‚„	0.08	0.2 × 0.08 + 0.8 × 0.182	0.126
tâ‚…	0.35	0.2 × 0.35 + 0.8 × 0.151	0.171

Raw Average

0.26

â†’

EMA (α=0.2)

0.171

4

Compute Eviction Priority

priority = (1 âˆ’ score_ema) Ã— recency_decay

Higher priority â†’ evict sooner

recency_decay = 1 âˆ’ e^{âˆ’Î² Ã— steps}

Î² = 0.001, ~63% decay at 1000 steps

A pos=1024, accessed 50 steps ago

score_ema0.171

recency_decay0.049

priority(1âˆ’0.211) Ã— 0.049 = 0.039

âœ“ KEEP

B pos=45678, accessed 2000 steps ago

score_ema0.08

recency_decay0.865

priority(1âˆ’0.08) Ã— 0.865 = 0.796

🗑 EVICT

5

Decision Matrix

eviction_priority = f(recency, score_ema)

High Score + Old

âš Watch

System Prompt Token

ema=0.72 Â· 3000 steps Â· p=0.23

Was critical, might be again

High Score + Recent

âœ“ Keep

Recent Context Token

ema=0.85 Â· 20 steps Â· p=0.003

Actively used, high value

Low Score + Old

🗑 Evict

Filler Word "the"

ema=0.03 Â· 5000 steps Â· p=0.91

Never important, stale

Low Score + Recent

âš Watch

New User Input

ema=0.05 Â· 10 steps Â· p=0.009

Just arrived, give it time

â† Old | Recent â†’ â†’ High Score | Low Score â†“

Attention-Aware Cache Eviction