GPU Integration

🗺

8.1 Memory Mapping

CXL.mem regions appear in the GPU's unified virtual address space via PCIe BAR mapping with CXL.mem bridging. The GPU accesses remote memory using standard load/store semantics.

GPU Virtual Address Space

HBM (Local)

0x0000 â€” 0x4FFF

192 GB

CXL.mem Region 0

0x5000 â€” 0x8FFF

64 GB

CXL.mem Region 1

0x9000 â€” 0xCFFF

64 GB

CXL.mem
Bridge

load/store
semantics

Endpoint Physical Memory

KV-Cache Pool

EP0: 0x0000

48 GB

Model Weights

EP0: 0x3000

64 GB

KV-Cache Overflow

EP1: 0x0000

128 GB

💻 GPU Kernel Access Pattern

// GPU kernel accessing CXL.mem KV-cache
__global__ void attention_kernel(float* kv_cache) {
  // Direct load from CXL.mem address
  float4 kv = *(float4*)(kv_cache + offset);

  // GPU MMU translates VA â†’ CXL.mem address
  // CXL bridge handles coherent access
  // No explicit DMA, no driver intervention
}

📋

8.2 Hint Interface

Extended allocation API communicates memory characteristics to the endpoint, enabling intelligent caching and prefetch decisions.

struct cxl_alloc_hints

enum data_type KV_CACHE, WEIGHTS, ACTIVATIONS, SCRATCH

enum access_pattern SEQUENTIAL, RANDOM, STRIDED, BROADCAST

enum consistency RELAXED, ACQUIRE_RELEASE, SEQ_CST

u32 prefetch_window Lookahead distance in cache lines

u32 priority Eviction priority (0 = evict first)

u64 expected_size Allocation size hint for placement

struct prefetch_params

u32 window_size Number of entries to prefetch ahead

u32 stride Access stride for strided patterns

bool rope_aware Enable RoPE position locality

u32 rope_window Position window [P-W, P+W]

float ema_alpha EMA decay for attention scoring

u32 head_count GQA head count for per-head tracking

DATA_TYPE_KV_CACHE

Per-head eviction, EMA scoring, RoPE prefetch enabled

DATA_TYPE_WEIGHTS

Read-only, layer prefetch, broadcast-optimized

DATA_TYPE_ACTIVATIONS

High-bandwidth, sequential, short-lived

DATA_TYPE_SCRATCH

Temporary workspace, lowest eviction priority

âš™

8.3 Driver â†’ Firmware Translation

The GPU driver translates allocation hints into endpoint firmware configuration via CXL.io mailbox commands.

1

📍

Application Allocation Request

Runtime calls extended allocation API with hints

cxl_malloc(size, &hints)

2

🔄

Driver Hint Processing

Driver validates hints, selects target endpoint based on capacity and locality

select_endpoint(hints) â†’ EP0

3

📦

Mailbox Command Construction

Pack hints into CXL.io mailbox payload with vendor-specific extensions

build_mailbox_cmd(VENDOR_ALLOC, hints)

4

📡

CXL.io Mailbox Transfer

Command sent over PCIe to endpoint controller

cxl_mailbox_send(ep, cmd)

5

🧠

Firmware Configuration

Endpoint firmware configures caching policy, prefetcher, and eviction strategy

configure_region(addr, policy)

🔬 CXL.io Mailbox Protocol

GPU Driver

Host CPU

â†’

PCIe / CXL.io

Transport

â†’

Endpoint FW

ARM Cores

Mailbox Command Payload

opcode: 0xC0 (VENDOR_ALLOC) size: 4096 data_type: KV_CACHE pattern: RANDOM prefetch_window: 64 ema_alpha: 0.2 head_count: 8 rope_aware: true

âš¡

8.4 Fault Handling & Latency Breakdown

When GPU accesses an uncached CXL.mem address, a fault triggers the full access path. Total latency is the sum of each stage.

~50 ns

Latency

GPU MMU Handling

TLB miss, page table walk, virtual â†’ physical translation. GPU MMU identifies CXL.mem region from PTE flags.

~30 ns

Latency

CXL Protocol Processing

CXL.mem request formation, coherence state check (if applicable), request queuing at CXL bridge.

~70 ns

Latency

PCIe Transmission

Request traverses PCIe 5.0 link (or CXL 3.0 fabric if through switch). Round-trip wire delay + PHY processing.

~50 ns

Latency

Endpoint Memory Access

Endpoint controller processes request, accesses local DRAM (or triggers flash fetch if uncached), returns data.

Total Fault Latency

MMU + CXL + PCIe + Endpoint =

~200 ns

Access Type	Latency	Bandwidth	CPU Involved
HBM (Local)	~30 ns	8 TB/s	No
CXL.mem (Direct)	~200 ns	64 GB/s per EP	No
PCIe DMA (Traditional)	~13 Î¼s	64 GB/s	Yes (driver)
NVMe Read	~10-20 Î¼s	14 GB/s	Yes (filesystem)