RoPE-Aware Prefetching

Exploiting position encoding locality for predictive cache loading

1
RoPE Creates Distance-Dependent Attention
Attention Relative Position (distance from query)
Query Position
Rotary Encoding
RoPE encodes position by rotating Q/K vectors. Nearby positions have similar rotations → higher dot product.
Locality Bias
On average, ~70% of attention mass falls within ±W positions of the query, even for long contexts.
Predictable Access
If GPU requests position P, it will likely need P±W soon. Prefetch proactively.
2
Prefetch Window Strategy
Prefetch Rule
GPU accesses P → Prefetch [P − W, P + W]
W = window size, tuned per model based on attention distribution
Prefetch Window (2W)
P
Current Access
P−W P+W
P−2W
P−W
P
P+W
P+2W
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Current Access (P=55)
Prefetch Zone (W=5)
On Storage
3
Empirical Window Size Selection
Attention mass distribution · Green = within prefetch window
W = 32
~50% attention captured
Low bandwidth, more misses
W = 512
~90% attention captured
High bandwidth overhead