◈ NVIDIA BLACKWELL
🌐 GLOBAL MEMORY (HBM3e)
▼ TMA ASYNC
📦 SHARED MEMORY (228KB)
▼ tcgen05.cp
🔥 TMEM (256KB DEDICATED)
▼ tcgen05.mma
🧮 TENSOR CORE (4.5 PFLOPS)
✓ Registers FREE for computation
✓ Async pipeline - no stalls
✓ Universal 16×16 layout
◈ AMD CDNA3/4
🌐 GLOBAL MEMORY (HBM3)
▼ BUFFER LOAD
📦 LDS (SHARED MEMORY)
▼ VGPR LOAD
📝 VGPR (256) + AGPR (256)
▼ MFMA (SYNC)
🧮 TENSOR CORE (MFMA)
⚠ Registers hold EVERYTHING
⚠ Synchronous - blocks execution
⚠ Shape-specific layouts