Section 8

GPU Integration

Memory mapping, hint interfaces, and fault handling for CXL.mem

🗺
8.1 Memory Mapping

CXL.mem regions appear in the GPU's unified virtual address space via PCIe BAR mapping with CXL.mem bridging. The GPU accesses remote memory using standard load/store semantics.

GPU Virtual Address Space
HBM (Local)
0x0000 — 0x4FFF
192 GB
CXL.mem Region 0
0x5000 — 0x8FFF
64 GB
CXL.mem Region 1
0x9000 — 0xCFFF
64 GB
CXL.mem
Bridge
load/store
semantics
Endpoint Physical Memory
KV-Cache Pool
EP0: 0x0000
48 GB
Model Weights
EP0: 0x3000
64 GB
KV-Cache Overflow
EP1: 0x0000
128 GB
💻 GPU Kernel Access Pattern
// GPU kernel accessing CXL.mem KV-cache
__global__ void attention_kernel(float* kv_cache) {
  // Direct load from CXL.mem address
  float4 kv = *(float4*)(kv_cache + offset);
  
  // GPU MMU translates VA → CXL.mem address
  // CXL bridge handles coherent access
  // No explicit DMA, no driver intervention
}
📋
8.2 Hint Interface

Extended allocation API communicates memory characteristics to the endpoint, enabling intelligent caching and prefetch decisions.

struct cxl_alloc_hints
enum data_type KV_CACHE, WEIGHTS, ACTIVATIONS, SCRATCH
enum access_pattern SEQUENTIAL, RANDOM, STRIDED, BROADCAST
enum consistency RELAXED, ACQUIRE_RELEASE, SEQ_CST
u32 prefetch_window Lookahead distance in cache lines
u32 priority Eviction priority (0 = evict first)
u64 expected_size Allocation size hint for placement
struct prefetch_params
u32 window_size Number of entries to prefetch ahead
u32 stride Access stride for strided patterns
bool rope_aware Enable RoPE position locality
u32 rope_window Position window [P-W, P+W]
float ema_alpha EMA decay for attention scoring
u32 head_count GQA head count for per-head tracking
DATA_TYPE_KV_CACHE
Per-head eviction, EMA scoring, RoPE prefetch enabled
DATA_TYPE_WEIGHTS
Read-only, layer prefetch, broadcast-optimized
DATA_TYPE_ACTIVATIONS
High-bandwidth, sequential, short-lived
DATA_TYPE_SCRATCH
Temporary workspace, lowest eviction priority
âš™
8.3 Driver → Firmware Translation

The GPU driver translates allocation hints into endpoint firmware configuration via CXL.io mailbox commands.

1
📍
Application Allocation Request
Runtime calls extended allocation API with hints
cxl_malloc(size, &hints)
2
🔄
Driver Hint Processing
Driver validates hints, selects target endpoint based on capacity and locality
select_endpoint(hints) → EP0
3
📦
Mailbox Command Construction
Pack hints into CXL.io mailbox payload with vendor-specific extensions
build_mailbox_cmd(VENDOR_ALLOC, hints)
4
📡
CXL.io Mailbox Transfer
Command sent over PCIe to endpoint controller
cxl_mailbox_send(ep, cmd)
5
🧠
Firmware Configuration
Endpoint firmware configures caching policy, prefetcher, and eviction strategy
configure_region(addr, policy)
🔬 CXL.io Mailbox Protocol
GPU Driver
Host CPU
→
PCIe / CXL.io
Transport
→
Endpoint FW
ARM Cores
Mailbox Command Payload
opcode: 0xC0 (VENDOR_ALLOC) size: 4096 data_type: KV_CACHE pattern: RANDOM prefetch_window: 64 ema_alpha: 0.2 head_count: 8 rope_aware: true
âš¡
8.4 Fault Handling & Latency Breakdown

When GPU accesses an uncached CXL.mem address, a fault triggers the full access path. Total latency is the sum of each stage.

~50 ns
Latency
GPU MMU Handling
TLB miss, page table walk, virtual → physical translation. GPU MMU identifies CXL.mem region from PTE flags.
~30 ns
Latency
CXL Protocol Processing
CXL.mem request formation, coherence state check (if applicable), request queuing at CXL bridge.
~70 ns
Latency
PCIe Transmission
Request traverses PCIe 5.0 link (or CXL 3.0 fabric if through switch). Round-trip wire delay + PHY processing.
~50 ns
Latency
Endpoint Memory Access
Endpoint controller processes request, accesses local DRAM (or triggers flash fetch if uncached), returns data.
Total Fault Latency
MMU + CXL + PCIe + Endpoint =
~200 ns
Access Type Latency Bandwidth CPU Involved
HBM (Local) ~30 ns 8 TB/s No
CXL.mem (Direct) ~200 ns 64 GB/s per EP No
PCIe DMA (Traditional) ~13 μs 64 GB/s Yes (driver)
NVMe Read ~10-20 μs 14 GB/s Yes (filesystem)