Bandwidth Aggregation

âš

Single Endpoint Limitation

CXL bandwidth per endpoint is constrained by PCIe lane count. A single x16 Gen5 endpoint maxes out regardless of internal capability.

64 GB/s

x16 PCIe 5.0 limit

â±

Layer Prefetch: Overlap Execution & Transfer

🔮

Execution Trace Predictor

Observes GPU execution â†’ predicts next layer â†’ triggers prefetch

0 ms

5 ms

10 ms

15 ms

20 ms

🖥 GPU Compute

Layer N 5ms

Layer N+1 5ms

Layer N+2 5ms

Layer N+3 5ms

📡 Prefetch (k=1)

N+1

N+2

N+3

N+4

💾 HBM State

N, N+1

N+1, N+2

N+2, N+3

N+3, N+4

âœ“

Perfect Overlap: Prefetch N+1 completes before GPU finishes N

🔑

Key Insight: T_prefetch < T_compute â†’ zero stall

â³

Layer Execution Time

Compute budget available for overlap

📦

Layer Size

Transfer requirement per layer

📈

Aggregate Bandwidth

endpoints Ã— per-endpoint BW

The Math: Can Prefetch Keep Up?

Llama-70B Layer Size

~1.75 GB

140 GB Ã· 80 layers

Layer Compute Time

~5-10 ms

B200 @ FP16

Required BW

175-350 GB/s

1.75 GB Ã· 5-10 ms

âŒ 1 Endpoint (64 GB/s)

1.75 GB Ã· 64 GB/s = 27.3 ms 5Ã— slower than compute â†’ GPU stalls

âš 3 Endpoints (192 GB/s)

1.75 GB Ã· 192 GB/s = 9.1 ms Borderline â†’ works if compute is slow

âœ“ 5 Endpoints (320 GB/s)

1.75 GB Ã· 320 GB/s = 5.5 ms Prefetch completes before compute â†’ no stall

🦙

Llama-70B Example

140 GB FP16 weights total

Endpoint DRAM

All Model Weights

140 GB

All layers resident â†’ no prefetch needed

Endpoint Flash

KV-Cache Overflow

1+ TB

Long context / multi-session storage

âœ“

With sufficient endpoint DRAM, all layers stay resident. No prefetch latency, no layer swapping. Flash handles KV-cache overflow for 128K+ context windows and multi-tenant session storage.