Β© 2025 Subramaniyam (Sam) Pooni
All Rights Reserved
Proprietary & Confidential
Chapter 3

Distributed Endpoint Architecture

CXL 3.0 computational storage endpoints with controller-resident intelligence for autonomous KV-cache management.

11
Figures in Chapter
250 ns
CXL Latency
65Γ—
vs PCIe DMA
1 TB
DRAM Capacity

3.1 Controller Offloading Concept

Instead of passive memory managed by CPU, we deploy intelligent computational storage endpoints with their own ARM processors that autonomously manage cache placement, eviction, and prefetching.

❌ Traditional: CPU-Managed

  • CPU in critical path
  • 5-10 ΞΌs latency
  • No predictive prefetch
  • System RAM contention

βœ“ Our Approach: Endpoint-Managed

  • Direct GPU ↔ Endpoint
  • 250 ns latency (40Γ— faster)
  • Autonomous prefetch
  • Dedicated memory pool
Figure 3.1 β€” Complete System Architecture Open Full Screen β†—

3.2 Endpoint Internal Architecture

ComponentSpecificationPurpose
ControllerARM Cortex-A78 (4-8 cores)Cache management intelligence
DRAM256 GB DDR5-5600Hot KV-cache tier
Flash4 TB NVMe Gen5Cold tier + overflow
CXL ControllerType-3 DeviceGPU memory access
UplinkΓ—16 PCIe Gen5 (64 GB/s)Host connection
Figure 3.2 β€” Endpoint Internal Components Open Full Screen β†—

3.3 CXL 3.0 Architecture

CXL (Compute Express Link) provides cache-coherent memory access with load/store semanticsβ€”no explicit DMA required.

ProtocolDirectionPurpose
CXL.ioBidirectionalPCIe-equivalent I/O, config, interrupts
CXL.cacheDevice β†’ HostDevice caches host memory
CXL.memHost β†’ DeviceHost accesses device memory (our primary)
Figure 3.3 β€” CXL Architecture Diagram View TSX Source β†—

3.4 Three-Tier Memory Hierarchy

TierMediaCapacityLatencyContents
Tier 0GPU HBM (Pinned)~5 GB100 nsModel weights, active layer
Tier 1GPU HBM (Evictable)~37 GB100 nsHot KV-cache entries
Tier 2CXL DRAM1 TB250 nsWarm KV-cache entries
Tier 3NVMe Flash16 TB25 ΞΌsCold entries, overflow
Figure 3.4 β€” Memory Hierarchy Visualization View TSX Source β†—

3.5 System Topology (4 Endpoints)

πŸ’‘ Why 4 Endpoints?

4 endpoints provide optimal cost/performance balance: 1 TB DRAM, 256 GB/s aggregate bandwidth, and ~$5K cost. Adding more endpoints yields diminishing returns due to switch overhead.

1.2 TB
Total Capacity
256 GB/s
Aggregate BW
250 ns
DRAM Latency
~$5K
Per Endpoint