Llama-70B consists of 80 identical layers. Each layer performs attention followed by a feed-forward transformation.

Diagram 1.1 â€” Layer Structure

Input Embedding

Ã— 80 LAYERS

Self-Attention

Feed-Forward

Output Logits

Architecture Parameters

Layers	80
Hidden dim	8,192
Query heads	64
KV heads	8
Head dim	128
FFN dim	28,672
Parameters	70B

2. Attention Mechanism

Each token computes Query, Key, and Value vectors. Attention scores determine how much each previous token contributes to the output.

Diagram 2.1 â€” Q, K, V Projections

Input x
8,192 dims

W_Q
Query

W_K
Key

W_V
Value

Q
128d

K
128d

V
128d

What each vector represents:

Q â€” "What am I looking for?"

K â€” "What do I contain?"

V â€” "What do I contribute?"

Diagram 2.2 â€” Attention Score Computation

Token "France" attending to previous tokens:

Position 0

Q_france Â· K_the =

0.12

Position 1

Q_france Â· K_capital =

0.87

Position 2

Q_france Â· K_of =

0.23

Position 3

Q_france Â· K_france =

0.45

After softmax normalization:

The

capital

52%

14%

France

28%

Output = weighted sum of V vectors

3. KV-Cache Structure

The KV-cache stores Key and Value vectors for all processed tokens, eliminating redundant computation during generation.

Diagram 3.1 â€” KV-Cache Organization

Layer 1

K_h0

K_h1

K_h2

K_h3

K_h4

K_h5

K_h6

K_h7

V_h0

V_h1

V_h2

V_h3

V_h4

V_h5

V_h6

V_h7

â‹® Ã— 80 layers â‹®

Size = L Ã— H_kv Ã— seq_len Ã— d_head Ã— 2 Ã— bytes = 80 Ã— 8 Ã— seq_len Ã— 128 Ã— 2 Ã— 2 = 320 KB per token

Diagram 3.2 â€” KV-Cache Size Scaling

4K tokens

1.3 GB

32K tokens

10 GB

128K tokens

41 GB

512K tokens

164 GB

1M tokens

328 GB

B200 HBM capacity: 192 GB

4. Prefill vs Decode

The two phases of inference have fundamentally different computational characteristics.

Diagram 4.1 â€” Phase Comparison

Prefill Phase

Processing the prompt

The

capital

France

âœ“ All tokens processed in parallel

âœ“ High arithmetic intensity

âœ“ Compute-bound

Decode Phase

Generating response

The

capital

...

â†’

Paris

âœ— One token at a time

âœ— Must read entire KV-cache

âœ— Memory-bandwidth-bound

Diagram 4.2 â€” Decode Memory Access Pattern

To generate 1 token:

Model Weights

140 GB

KV-Cache

41 GB (at 128K)

Total reads

181 GB

Bandwidth requirement:

At 20 tokens/sec: 181 GB Ã— 20 = 3.62 TB/s B200 provides: 8 TB/s âœ“

5. The Memory Wall

GPU memory capacity, not compute or bandwidth, becomes the limiting factor with multiple users.

Diagram 5.1 â€” Single User Memory Layout

B200 HBM â€” 192 GB 10 GB free

Model Weights â€” 140 GB

KV â€” 41 GB

Free

âœ“ Single user at 128K context fits

Diagram 5.2 â€” Multi-User Memory Explosion

2 users 222 GB needed

Weights

KV Ã—2

4 users 304 GB needed

Weights

KV Ã—4

8 users 468 GB needed

Weights

KV Ã—8

B200 capacity: 192 GB

8 users need: 468 GB (2.4Ã— over)

6. Attention Locality

Empirical measurement reveals that attention concentrates heavily on recent tokens.

Diagram 6.1 â€” Attention Distribution (10K Context)

0-1K
5%

1-9K
15%

9-10K
80%

Key insight:

~80%

of attention goes to

~10%

most recent tokens

Diagram 6.2 â€” Attention Heatmap (Simplified)

Current token attending to context

â† Earlier | Recent â†’

Position 0 Position N

Low attention

Medium

High attention

7. RoPE: Why Locality Emerges

Rotary Position Embedding creates locality as a geometric property of how positions are encoded.

Diagram 7.1 â€” RoPE Rotation Concept

Each dimension pair rotates at different frequency

Position m rotates each pair by m Ã— Î¸áµ¢

Frequency formula:

Î¸áµ¢ = 10000^âˆ’2i/d Î¸â‚€ = 1.0 (fast) Î¸â‚ƒâ‚‚ = 0.01 (medium) Î¸â‚†â‚ƒ = 0.0001 (slow)

Fast dims â†’ local patterns
Slow dims â†’ global patterns

Diagram 7.2 â€” Distance-Dependent Decay

Average cosine factor by distance:

Distance 1

0.99

Distance 10

0.95

Distance 100

0.71

Distance 1,000

0.32

Distance 10,000

0.11

score(m, n) âˆ Î£áµ¢ cos((m âˆ’ n) Â· Î¸áµ¢) Small distance â†’ cos â‰ˆ 1 â†’ high attention Large distance â†’ cos oscillates â†’ lower attention

8. Attention Head Types

Different attention heads specialize for different functions, creating varied access patterns.

Diagram 8.1 â€” Head Specialization

Recency Heads

~40% of heads

Focus on last 50-200 tokens

Anchor Heads

~15% of heads

Always check position 0-100

Retrieval Heads

~25% of heads

Content-based, position-independent

Syntactic Heads

~20% of heads

Follow grammatical dependencies

Implication: A single caching policy cannot satisfy all heads. Per-head tracking required.

9. CXL Architecture

Diagram 9.1 â€” System Topology

NVIDIA B200

192 GB

HBM Capacity

8 TB/s

Bandwidth

100ns

Latency

CXL 3.0 Ã— 16 â€” 64 GB/s

CXL Switch / Fabric

EP 0

256 GB

EP 1

256 GB

EP 2

256 GB

EP 3

256 GB

Total CXL: 1 TB @ 256 GB/s aggregate

Diagram 9.2 â€” HBM vs CXL Comparison

	GPU HBM	CXL DRAM	Ratio
Bandwidth	8 TB/s	256 GB/s	31Ã— less
Latency	100 ns	250 ns	2.5Ã— more
Capacity	192 GB	1 TB	5Ã— more
Cost per GB	~$50	~$5	10Ã— less

CXL tradeoff: 10Ã— cheaper per GB, but 31Ã— lower bandwidth. Viable only if most accesses hit HBM.

10. Tiered Memory Hierarchy

Diagram 10.1 â€” Three-Tier Architecture

Tier 0 â€” HBM Pinned

Anchor zone + critical tokens

5 GB

100 ns

Tier 1 â€” HBM Evictable

Recent + high-attention tokens

37 GB

100 ns

Tier 2 â€” CXL DRAM

Cold tokens, low attention

280 GB

250 ns

Diagram 10.2 â€” Memory Layout (8 Users Ã— 128K)

HBM â€” 192 GB

Model Weights â€” 140 GB

Hot KV

Weights

Activations

Pinned KV

Hot KV

CXL â€” 1 TB

Cold KV â€” 280 GB

Available â€” 720 GB

11. EMA Scoring Algorithm

Exponential Moving Average tracks which tokens actually receive attention over time.

Diagram 11.1 â€” EMA Update Rule

score_t = Î± Â· attention_t + (1 âˆ’ Î±) Â· score_tâˆ’1

Î± = 0.2

Decay factor

3.1 steps

Half-life

~155 ms

At 20 tok/s

Diagram 11.2 â€” EMA Evolution Example

System Instruction Token

Position 50 â€” "helpful"

Consistent attention from anchor heads:

Step 0: attn=0.04 â†’ score=0.008
Step 1: attn=0.03 â†’ score=0.012
Step 2: attn=0.05 â†’ score=0.020
...
Step 100: â†’ score=0.040

â†’ Stays HOT

Generic Middle Token

Position 45,000 â€” "the"

Rarely attended:

Step 0: attn=0.001 â†’ score=0.0002
Step 1: attn=0.000 â†’ score=0.0002
Step 2: attn=0.002 â†’ score=0.0005
...
Step 100: â†’ score=0.001

â†’ Evict to CXL

12. Priority Scoring Formula

Final placement decisions combine multiple signals into a single priority score.

Diagram 12.1 â€” Scoring Components

Recency

25%

EMA Score

55%

Anchor Zone

20%

P(p) = 0.25 Â· R(p) + 0.55 Â· E(p) + 0.20 Â· N(p)

Diagram 12.2 â€” Tier Assignment Thresholds

Tier 2 (CXL)

Tier 1 (HBM)

Tier 0 (Pinned)

P = 0 P = 0.3 P = 0.6 P = 1.0

13. Per-Head Tracking

Scores are maintained separately for each KV-head to handle head specialization.

Diagram 13.1 â€” Per-Head Score Matrix

Position	Head 0 (recency)	Head 1 (anchor)	Head 2 (retrieval)	Head 3-7	Aggregate	Decision
0 (system)	0.001	0.089	0.012	...	0.089	HBM
45,000	0.000	0.002	0.003	...	0.003	CXL
99,950	0.082	0.004	0.031	...	0.082	HBM

P_aggregate(p) = max( P_h(p) ) for all heads h

Position stays in HBM if ANY head needs it. Evict only when NO head has recent access.

14. Prefetching Strategy

Diagram 14.1 â€” Prefetch Targets

Anchor (0-100)

Recent (m-200 to m)

High-EMA

1. Anchor zone [0, 100]

2. Recent [mâˆ’200, mâˆ’1]

3. High-EMA positions

Diagram 14.2 â€” Prefetch Timing Budget

Token Generation

50 ms

âˆ’

Compute

20 ms

Prefetch Window

30 ms

Prefetch capacity @ 256 GB/s:

7.68 GB â‰ˆ 24,000 positions

15. Hit Rate Progression

Diagram 15.1 â€” Algorithm Contribution

LRU baseline

70%

+ Anchor pinning

78%

+ EMA scoring

85%

+ Per-head tracking

91%

+ Prefetching

95%

Diagram 15.2 â€” Effective Latency

L_eff = (hit_rate Ã— L_HBM) + ((1 âˆ’ hit_rate) Ã— L_CXL)

At 95% hit rate:

107.5 ns

0.95 Ã— 100 + 0.05 Ã— 250

Overhead vs pure HBM:

+7.5%

(107.5 âˆ’ 100) / 100

16. Final Results

Diagram 16.1 â€” System Comparison

Without CXL

Memory 192 GB

Users @ 128K 1

Cost (8 users) $70K

Hardware 2Ã— B200

With CXL + Tiering

Memory 1.2 TB

Users @ 128K 8+

Cost (8 users) $45K

Hardware 1Ã— B200 + CXL

Diagram 16.2 â€” Key Metrics

6Ã—

Memory Expansion

8Ã—

User Capacity

36%

Cost Reduction

7.5%

Latency Overhead

95%

HBM Hit Rate

Tokens/sec/user

Technical Appendix

1. Transformer Architecture

2. Attention Mechanism

3. KV-Cache Structure

4. Prefill vs Decode

5. The Memory Wall

6. Attention Locality

7. RoPE: Why Locality Emerges

8. Attention Head Types

9. CXL Architecture

10. Tiered Memory Hierarchy

11. EMA Scoring Algorithm

12. Priority Scoring Formula

13. Per-Head Tracking

14. Prefetching Strategy

15. Hit Rate Progression

16. Final Results