NVIDIA vs AMD — Live Data Flow Animation

NVIDIA Blackwell Async Pipeline with TMA + TMEM

🌐 Global Memory

HBM3e • 8 TB/s

TMA Async ⚡

⚡ TMA Unit

Hardware Addr Gen • Async

Async Copy

📦 Shared Memory

228 KB SMEM

tcgen05.cp 🔥

🔥 TMEM

256 KB Dedicated • Bypasses Registers!

tcgen05.mma

🧮 Tensor Core

4.5 PFLOPS FP8

Pipeline Status

Async: Data flows continuously

Registers: FREE for compute

Overlap: Compute + Memory

~3s cycle

AMD CDNA3/4 Synchronous with VGPR + AGPR

🌐 Global Memory

HBM3 • 5.3 TB/s

Buffer Load 🐢

📥 Buffer Load

Uses ALU for Addr Gen

Sync Wait ⏳

📦 LDS

Local Data Share

VGPR Load ⚠️

📝 VGPR (256)

Holds A,B + Addresses + Everything!

v_accvgpr_write

🎯 AGPR (256)

Accumulators Only • Restricted

MFMA (BLOCKS!) 🛑

🧮 MFMA

Synchronous • Blocks Until Done

Pipeline Status

Sync: Must wait at each stage

Registers: BUSY holding tiles

No Overlap: Sequential execution

~6s cycle (2x slower)

🔍 What You're Seeing

💚

NVIDIA: Continuous Flow

Multiple data packets flowing simultaneously. TMA handles addresses in hardware. TMEM stores tiles separately from registers. Compute and memory overlap.

❤️

AMD: Stop-and-Wait

Packets pause at each stage (watch the animation!). MFMA blocks until complete. Registers hold everything — competing for space. No compute/memory overlap.

💡 The Key Difference

NVIDIA sends 6 packets in the same time AMD sends 2 packets.
Not because AMD hardware is slower — but because async vs sync architecture fundamentally changes throughput.

Both can hit peak FLOPS. But NVIDIA makes it easier for software to keep the tensor cores fed.