RoPE Prefetch: Worked Example

Step-by-step walkthrough of locality-aware KV-cache prefetching

1
Scenario Configuration
Sequence Length
16
tokens
GPU Cache Size
8
KV slots
Prefetch Window
W = 3
±3 positions
Storage Latency
50
µs per fetch
Query at position P → Prefetch [P−3, P+3] into GPU cache
0The
1quick
2brown
3fox
4jumps
5over
6the
7lazy
8dog
9and
10runs
11into
12the
13dark
14forest
15.
2
RoPE Attention Distribution (Query @ P=8)
0
1
2
3
4
5
6
7
Q
9
10
11
12
13
14
15
Low attention
High attention

RoPE causes attention to concentrate around the query position. Positions 5–11 (within ±3 of query P=8) capture ~72% of total attention mass.

3
Prefetch Execution Timeline
🔴 Cache Miss: Query P=8 ("dog")
GPU requests KV for position 8. Not in cache → fetch from storage.
Trigger prefetch for window [5, 11].
Cache: empty
⏳ Fetching P=8 + Prefetch [5,11]
Storage returns position 8. Async prefetch loads positions 5,6,7,9,10,11.
Cache:
5
6
7
8
9
10
11
🟢 Cache Hit: Attention needs P=7 ("lazy")
High attention weight to position 7 → already prefetched!
🟢 Cache Hits: P=6, P=9, P=10
Remaining high-attention positions all hit in prefetch window.
âž¡ Next Query: P=12 ("the")
Autoregressive decode moves to next token. New window [9, 15].
Positions 9,10,11 already cached → only fetch 12,13,14,15.
Evict:
5
6
7
8
Keep:
9
10
11
New:
12
13
14
15
4
Performance Results
85%
Cache Hit Rate
72%
Attention Captured
3.2×
Latency Reduction
1.4×
Bandwidth Overhead
❌ Naive (No Prefetch)
Fetches per query 16
Avg latency 800 µs
Storage round-trips 16
✅ RoPE Prefetch (W=3) Winner
Fetches per query 7
Avg latency 250 µs
Storage round-trips 2
5
Key Insight

RoPE locality = predictable access patterns

Because rotary position encoding causes attention to concentrate near the query position, we can predict which KV pairs will be needed and fetch them before the GPU stalls. This transforms random storage access into sequential prefetch streams.

Effective Bandwidth
4.8 GB/s
vs 1.5 GB/s naive