Section 5

Bandwidth Aggregation

Linear scaling through CXL 3.0 switch topology

âš 

Single Endpoint Limitation

CXL bandwidth per endpoint is constrained by PCIe lane count. A single x16 Gen5 endpoint maxes out regardless of internal capability.

64 GB/s
x16 PCIe 5.0 limit
🔀
CXL 3.0 Switch Aggregation
GPU
B200
CXL 3.0
CXL 3.0 Switch
Multi-port fabric
EP 1
64 GB/s
EP 2
64 GB/s
EP 3
64 GB/s
EP 4
64 GB/s
EP 5
64 GB/s
Aggregated Bandwidth
5 endpoints × 64 GB/s =
320 GB/s
⏱
Layer Prefetch: Overlap Execution & Transfer
🔮
Execution Trace Predictor
Observes GPU execution → predicts next layer → triggers prefetch
0 ms
5 ms
10 ms
15 ms
20 ms
🖥 GPU Compute
Layer N 5ms
Layer N+1 5ms
Layer N+2 5ms
Layer N+3 5ms
📡 Prefetch (k=1)
N+1
N+2
N+3
N+4
💾 HBM State
N, N+1
N+1, N+2
N+2, N+3
N+3, N+4
✓
Perfect Overlap: Prefetch N+1 completes before GPU finishes N
🔑
Key Insight: Tprefetch < Tcompute → zero stall
⏳
Layer Execution Time
Compute budget available for overlap
📦
Layer Size
Transfer requirement per layer
📈
Aggregate Bandwidth
endpoints × per-endpoint BW
The Math: Can Prefetch Keep Up?
Llama-70B Layer Size
~1.75 GB
140 GB ÷ 80 layers
Layer Compute Time
~5-10 ms
B200 @ FP16
Required BW
175-350 GB/s
1.75 GB ÷ 5-10 ms
❌ 1 Endpoint (64 GB/s)
1.75 GB ÷ 64 GB/s = 27.3 ms 5× slower than compute → GPU stalls
âš  3 Endpoints (192 GB/s)
1.75 GB ÷ 192 GB/s = 9.1 ms Borderline → works if compute is slow
✓ 5 Endpoints (320 GB/s)
1.75 GB ÷ 320 GB/s = 5.5 ms Prefetch completes before compute → no stall
🦙
Llama-70B Example
140 GB FP16 weights total
Endpoint DRAM
All Model Weights
140 GB
All layers resident → no prefetch needed
Endpoint Flash
KV-Cache Overflow
1+ TB
Long context / multi-session storage
✓
With sufficient endpoint DRAM, all layers stay resident. No prefetch latency, no layer swapping. Flash handles KV-cache overflow for 128K+ context windows and multi-tenant session storage.