Preprocessing Offload

🐌 Traditional Path

CPU Bound

Storage NVMe

â†’

CPU DRAM DDR5

â†’

CPU Preprocessing x86 cores

â†’

GPU DRAM via PCIe

âš¡ Endpoint Path

Direct to GPU

Endpoint Flash NVMe

â†’

Endpoint DRAM DDR5

â†’

ARM Preprocessing Embedded cores

â†’

CXL.mem Direct

â†’

GPU HBM

🚫 Zero CPU involvement â€” data flows directly to GPU-accessible memory

âš™

Preprocessing Tasks for Embedded ARM

📍

Tokenization

SentencePiece, BPE

🖼

Image Decode

JPEG/PNG + normalize

🔄

Format Conversion

FP32 â†’ FP16/BF16

📦

Batching & Padding

Sequence alignment

ARM

Embedded ARM Cores

Throughput depends on core count and clock speed

📢

Core Count

4â€“16 Cortex-A cores

â±

Clock Speed

1.5â€“2.5 GHz

📊

Tokenization Rate

~500K tokens/sec

🖼

Image Throughput

~1000 img/sec (224Ã—224)

💡 Key Insight: Direct Data Path

ARM

â†’

CXL.mem

â†’

GPU

Data flows directly to GPU-accessible CXL.mem region without CPU involvement. No CPU âŒ copies, no PCIe DMA setup, no interrupt overhead. The endpoint handles everything from raw storage to GPU-ready tensors.