Decode bandwidth requirements and prefetch window sizing.
For Llama-70B generating 20 tokens/second:
| Component | Size |
|---|---|
| Attention weights | 536 MB |
| FFN weights | 1.2 GB |
| KV-cache slice | 512 MB |
| Total per layer | 2.27 GB |
With 2-layer prefetch depth and 625 μs per layer:
This covers ~0.14× one layer—enough to hide most CXL latency through overlap.
| Endpoints | BW | Capacity | Cost |
|---|---|---|---|
| 2 | 128 GB/s | 512 GB | $2,500 |
| 4 | 256 GB/s | 1 TB | $5,000 |
| 8 | 512 GB/s | 2 TB | $10,000 |
4 endpoints is optimal for cost/performance balance.