Chapter 3: Distributed Endpoint Architecture

Figures in Chapter

250 ns

CXL Latency

65×

vs PCIe DMA

1 TB

DRAM Capacity

3.1 Controller Offloading Concept

Instead of passive memory managed by CPU, we deploy intelligent computational storage endpoints with their own ARM processors that autonomously manage cache placement, eviction, and prefetching.

❌ Traditional: CPU-Managed

CPU in critical path
5-10 μs latency
No predictive prefetch
System RAM contention

✓ Our Approach: Endpoint-Managed

Direct GPU ↔ Endpoint
250 ns latency (40× faster)
Autonomous prefetch
Dedicated memory pool

Figure 3.1 — Complete System Architecture Open Full Screen ↗

3.2 Endpoint Internal Architecture

Component	Specification	Purpose
Controller	ARM Cortex-A78 (4-8 cores)	Cache management intelligence
DRAM	256 GB DDR5-5600	Hot KV-cache tier
Flash	4 TB NVMe Gen5	Cold tier + overflow
CXL Controller	Type-3 Device	GPU memory access
Uplink	×16 PCIe Gen5 (64 GB/s)	Host connection

Component

Specification

Purpose

Controller

ARM Cortex-A78 (4-8 cores)

Cache management intelligence

DRAM

256 GB DDR5-5600

Hot KV-cache tier

Flash

4 TB NVMe Gen5

Cold tier + overflow

CXL Controller

Type-3 Device

GPU memory access

Uplink

×16 PCIe Gen5 (64 GB/s)

Host connection

Figure 3.2 — Endpoint Internal Components Open Full Screen ↗

3.3 CXL 3.0 Architecture

CXL (Compute Express Link) provides cache-coherent memory access with load/store semantics—no explicit DMA required.

Protocol	Direction	Purpose
CXL.io	Bidirectional	PCIe-equivalent I/O, config, interrupts
CXL.cache	Device → Host	Device caches host memory
CXL.mem	Host → Device	Host accesses device memory (our primary)

Protocol

Direction

Purpose

CXL.io

Bidirectional

PCIe-equivalent I/O, config, interrupts

CXL.cache

Device → Host

Device caches host memory

CXL.mem

Host → Device

Host accesses device memory (our primary)

Figure 3.3 — CXL Architecture Diagram View TSX Source ↗

3.4 Three-Tier Memory Hierarchy

Tier	Media	Capacity	Latency	Contents
Tier 0	GPU HBM (Pinned)	~5 GB	100 ns	Model weights, active layer
Tier 1	GPU HBM (Evictable)	~37 GB	100 ns	Hot KV-cache entries
Tier 2	CXL DRAM	1 TB	250 ns	Warm KV-cache entries
Tier 3	NVMe Flash	16 TB	25 μs	Cold entries, overflow

Tier

Media

Capacity

Latency

Contents

Tier 0

GPU HBM (Pinned)

~5 GB

100 ns

Model weights, active layer

Tier 1

GPU HBM (Evictable)

~37 GB

100 ns

Hot KV-cache entries

Tier 2

CXL DRAM

1 TB

250 ns

Warm KV-cache entries

Tier 3

NVMe Flash

16 TB

25 μs

Cold entries, overflow

Figure 3.4 — Memory Hierarchy Visualization View TSX Source ↗

3.5 System Topology (4 Endpoints)

💡 Why 4 Endpoints?

4 endpoints provide optimal cost/performance balance: 1 TB DRAM, 256 GB/s aggregate bandwidth, and ~$5K cost. Adding more endpoints yields diminishing returns due to switch overhead.

1.2 TB

Total Capacity

256 GB/s

Aggregate BW

250 ns

DRAM Latency

~$5K

Per Endpoint