Move data preparation to the endpoint, bypass the CPU entirely
🐌Traditional Path
CPU Bound
StorageNVMe
→
CPU DRAMDDR5
→
CPU Preprocessingx86 cores
→
GPU DRAMvia PCIe
⚡Endpoint Path
Direct to GPU
Endpoint FlashNVMe
→
Endpoint DRAMDDR5
→
ARM PreprocessingEmbedded cores
→
CXL.memDirect
→
GPUHBM
🚫Zero CPU involvement — data flows directly to GPU-accessible memory
âš™
Preprocessing Tasks for Embedded ARM
📍
Tokenization
SentencePiece, BPE
🖼
Image Decode
JPEG/PNG + normalize
🔄
Format Conversion
FP32 → FP16/BF16
📦
Batching & Padding
Sequence alignment
ARM
Embedded ARM Cores
Throughput depends on core count and clock speed
📢
Core Count
4–16 Cortex-A cores
â±
Clock Speed
1.5–2.5 GHz
📊
Tokenization Rate
~500K tokens/sec
🖼
Image Throughput
~1000 img/sec (224×224)
💡Key Insight: Direct Data Path
ARM
→
CXL.mem
→
GPU
Data flows directly to GPU-accessible CXL.mem region without CPU involvement.
No CPU ⌠copies, no PCIe DMA setup, no interrupt overhead.
The endpoint handles everything from raw storage to GPU-ready tensors.