Per-Head Tracking with GQA

Llama-70B: 64 Query Heads → 8 KV-Heads × 80 Layers = 640 LRU Queues

1. GQA Grouping Defines Tracking Unit
Q0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
→
KV-Head 0
→
LRU Queue [layer, 0]
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
→
KV-Head 1
→
LRU Queue [layer, 1]
â‹®
Q56
Q57
Q58
Q59
Q60
Q61
Q62
Q63
→
KV-Head 7
→
LRU Queue [layer, 7]
2. 640 Independent LRU Queues
Layer 0
0
1
2
3
4
5
6
7
Layer 1
0
1
2
3
4
5
6
7
Layer 2 ... 79
0
1
2
3
4
5
6
7
3. What Each Queue Tracks
Queue [Layer 42, KV-Head 3]
← One of 640 queues
pos: 127891
access: 847
attn: 0.92
→
pos: 0
access: 512
attn: 0.88
→
pos: 1024
access: 234
attn: 0.45
→
pos: 89012
access: 89
attn: 0.31
→
pos: 45678
access: 12
attn: 0.08
→
EVICT
CANDIDATES
Per-Entry Metadata (8 bytes)
position_id — u32 (4B)
access_count — u16 (2B)
attention_score — fp16 (2B)
640 queues × 131,072 positions × 8 bytes = 640 MB metadata