TENSOR CORE ARCHITECTURE

◆ MEMORY HIERARCHY ANALYSIS ◆
94%
Tensor Core Usage
256KB
TMEM per SM
512
AMD Registers
8TB/s
HBM Bandwidth
◈ NVIDIA BLACKWELL
🌐 GLOBAL MEMORY (HBM3e)
▼ TMA ASYNC
📦 SHARED MEMORY (228KB)
▼ tcgen05.cp
🔥 TMEM (256KB DEDICATED)
▼ tcgen05.mma
🧮 TENSOR CORE (4.5 PFLOPS)
✓ Registers FREE for computation
✓ Async pipeline - no stalls
✓ Universal 16×16 layout
◈ AMD CDNA3/4
🌐 GLOBAL MEMORY (HBM3)
▼ BUFFER LOAD
📦 LDS (SHARED MEMORY)
▼ VGPR LOAD
📝 VGPR (256) + AGPR (256)
▼ MFMA (SYNC)
🧮 TENSOR CORE (MFMA)
⚠ Registers hold EVERYTHING
⚠ Synchronous - blocks execution
⚠ Shape-specific layouts
HOLOGRAPHIC DISPLAY ACTIVE
SYS://CS2B.TENSOR.ARCH.v2.6
© 2026 SUBRAMANIYAM POONI