RoPE Prefetch - Worked Example

1

Scenario Configuration

Sequence Length

16

tokens

GPU Cache Size

8

KV slots

Prefetch Window

W = 3

Â±3 positions

Storage Latency

50

Âµs per fetch

Query at position P â†’ Prefetch [Pâˆ’3, P+3] into GPU cache

0The

1quick

2brown

3fox

4jumps

5over

6the

7lazy

8dog

9and

10runs

11into

12the

13dark

14forest

15.

2

RoPE Attention Distribution (Query @ P=8)

0

1

2

3

4

5

6

7

Q

9

10

11

12

13

14

15

Low attention

High attention

RoPE causes attention to concentrate around the query position. Positions 5â€“11 (within Â±3 of query P=8) capture ~72% of total attention mass.

3

Prefetch Execution Timeline

🔴 Cache Miss: Query P=8 ("dog")

GPU requests KV for position 8. Not in cache â†’ fetch from storage.
Trigger prefetch for window [5, 11].

Cache: empty

â³ Fetching P=8 + Prefetch [5,11]

Storage returns position 8. Async prefetch loads positions 5,6,7,9,10,11.

Cache:

5

6

7

8

9

10

11

🟢 Cache Hit: Attention needs P=7 ("lazy")

High attention weight to position 7 â†’ already prefetched!

🟢 Cache Hits: P=6, P=9, P=10

Remaining high-attention positions all hit in prefetch window.

âž¡ Next Query: P=12 ("the")

Autoregressive decode moves to next token. New window [9, 15].
Positions 9,10,11 already cached â†’ only fetch 12,13,14,15.

Evict:

5

6

7

8

Keep:

9

10

11

New:

12

13

14

15

4

Performance Results

85%

Cache Hit Rate

72%

Attention Captured

3.2Ã—

Latency Reduction

1.4Ã—

Bandwidth Overhead

âŒ Naive (No Prefetch)

Fetches per query 16

Avg latency 800 Âµs

Storage round-trips 16

âœ… RoPE Prefetch (W=3) Winner

Fetches per query 7

Avg latency 250 Âµs

Storage round-trips 2

5

Key Insight

RoPE locality = predictable access patterns

Because rotary position encoding causes attention to concentrate near the query position, we can predict which KV pairs will be needed and fetch them before the GPU stalls. This transforms random storage access into sequential prefetch streams.

Effective Bandwidth

4.8 GB/s

vs 1.5 GB/s naive

RoPE Prefetch: Worked Example

RoPE locality = predictable access patterns