🔴 LIVE DATA FLOW ANIMATION

NVIDIA vs AMD

Watch how data flows through each architecture — completely different paths!

NVIDIA Blackwell Async Pipeline with TMA + TMEM
🌐 Global Memory
HBM3e • 8 TB/s
TMA Async ⚡
⚡ TMA Unit
Hardware Addr Gen • Async
Async Copy
📦 Shared Memory
228 KB SMEM
tcgen05.cp 🔥
🔥 TMEM
256 KB Dedicated • Bypasses Registers!
tcgen05.mma
🧮 Tensor Core
4.5 PFLOPS FP8
Pipeline Status
Async: Data flows continuously
Registers: FREE for compute
Overlap: Compute + Memory
~3s cycle
AMD CDNA3/4 Synchronous with VGPR + AGPR
🌐 Global Memory
HBM3 • 5.3 TB/s
Buffer Load 🐢
📥 Buffer Load
Uses ALU for Addr Gen
Sync Wait ⏳
📦 LDS
Local Data Share
VGPR Load ⚠️
📝 VGPR (256)
Holds A,B + Addresses + Everything!
v_accvgpr_write
🎯 AGPR (256)
Accumulators Only • Restricted
MFMA (BLOCKS!) 🛑
🧮 MFMA
Synchronous • Blocks Until Done
Pipeline Status
Sync: Must wait at each stage
Registers: BUSY holding tiles
No Overlap: Sequential execution
~6s cycle (2x slower)

🔍 What You're Seeing

💚

NVIDIA: Continuous Flow

Multiple data packets flowing simultaneously. TMA handles addresses in hardware. TMEM stores tiles separately from registers. Compute and memory overlap.

❤️

AMD: Stop-and-Wait

Packets pause at each stage (watch the animation!). MFMA blocks until complete. Registers hold everything — competing for space. No compute/memory overlap.

💡 The Key Difference

NVIDIA sends 6 packets in the same time AMD sends 2 packets.
Not because AMD hardware is slower — but because async vs sync architecture fundamentally changes throughput.

Both can hit peak FLOPS. But NVIDIA makes it easier for software to keep the tensor cores fed.