CXL 3.0 computational storage endpoints with controller-resident intelligence for autonomous KV-cache management.
Instead of passive memory managed by CPU, we deploy intelligent computational storage endpoints with their own ARM processors that autonomously manage cache placement, eviction, and prefetching.
| Component | Specification | Purpose |
|---|---|---|
| Controller | ARM Cortex-A78 (4-8 cores) | Cache management intelligence |
| DRAM | 256 GB DDR5-5600 | Hot KV-cache tier |
| Flash | 4 TB NVMe Gen5 | Cold tier + overflow |
| CXL Controller | Type-3 Device | GPU memory access |
| Uplink | Γ16 PCIe Gen5 (64 GB/s) | Host connection |
CXL (Compute Express Link) provides cache-coherent memory access with load/store semanticsβno explicit DMA required.
| Protocol | Direction | Purpose |
|---|---|---|
| CXL.io | Bidirectional | PCIe-equivalent I/O, config, interrupts |
| CXL.cache | Device β Host | Device caches host memory |
| CXL.mem | Host β Device | Host accesses device memory (our primary) |
| Tier | Media | Capacity | Latency | Contents |
|---|---|---|---|---|
| Tier 0 | GPU HBM (Pinned) | ~5 GB | 100 ns | Model weights, active layer |
| Tier 1 | GPU HBM (Evictable) | ~37 GB | 100 ns | Hot KV-cache entries |
| Tier 2 | CXL DRAM | 1 TB | 250 ns | Warm KV-cache entries |
| Tier 3 | NVMe Flash | 16 TB | 25 ΞΌs | Cold entries, overflow |
4 endpoints provide optimal cost/performance balance: 1 TB DRAM, 256 GB/s aggregate bandwidth, and ~$5K cost. Adding more endpoints yields diminishing returns due to switch overhead.