Multi-endpoint striping, layer prefetch pipelining, and achieving 256 GB/s aggregate bandwidth.
A single endpoint provides 64 GB/s. By striping KV-cache across 4 endpoints with layer-aware prefetching, we achieve 256 GB/s aggregate bandwidth.