Section 6

Preprocessing Offload

Move data preparation to the endpoint, bypass the CPU entirely

🐌 Traditional Path
CPU Bound
Storage NVMe
→
CPU DRAM DDR5
→
CPU Preprocessing x86 cores
→
GPU DRAM via PCIe
âš¡ Endpoint Path
Direct to GPU
Endpoint Flash NVMe
→
Endpoint DRAM DDR5
→
ARM Preprocessing Embedded cores
→
CXL.mem Direct
→
GPU HBM
🚫 Zero CPU involvement — data flows directly to GPU-accessible memory
âš™
Preprocessing Tasks for Embedded ARM
📍
Tokenization
SentencePiece, BPE
🖼
Image Decode
JPEG/PNG + normalize
🔄
Format Conversion
FP32 → FP16/BF16
📦
Batching & Padding
Sequence alignment
Embedded ARM Cores
Throughput depends on core count and clock speed
📢
Core Count
4–16 Cortex-A cores
⏱
Clock Speed
1.5–2.5 GHz
📊
Tokenization Rate
~500K tokens/sec
🖼
Image Throughput
~1000 img/sec (224×224)
💡 Key Insight: Direct Data Path
ARM
→
CXL.mem
→
GPU
Data flows directly to GPU-accessible CXL.mem region without CPU involvement. No CPU ❌ copies, no PCIe DMA setup, no interrupt overhead. The endpoint handles everything from raw storage to GPU-ready tensors.