RoPE Prefetch Strategy

1

RoPE Creates Distance-Dependent Attention

Attention Relative Position (distance from query)

Query Position

Rotary Encoding

RoPE encodes position by rotating Q/K vectors. Nearby positions have similar rotations â†’ higher dot product.

Locality Bias

On average, ~70% of attention mass falls within Â±W positions of the query, even for long contexts.

Predictable Access

If GPU requests position P, it will likely need PÂ±W soon. Prefetch proactively.

2

Prefetch Window Strategy

Prefetch Rule

GPU accesses P â†’ Prefetch [P âˆ’ W, P + W]

W = window size, tuned per model based on attention distribution

Prefetch Window (2W)

P

Current Access

Pâˆ’W P+W

Pâˆ’2W

Pâˆ’W

P

P+W

P+2W

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

Current Access (P=55)

Prefetch Zone (W=5)

On Storage

3

Empirical Window Size Selection

Attention mass distribution Â· Green = within prefetch window

W = 32

~50% attention captured

Low bandwidth, more misses

W = 128

~75% attention captured

âœ“ Balanced for Llama-70B

W = 512

~90% attention captured

High bandwidth overhead

RoPE-Aware Prefetching