Watch how data flows through each architecture — completely different paths!
Multiple data packets flowing simultaneously. TMA handles addresses in hardware. TMEM stores tiles separately from registers. Compute and memory overlap.
Packets pause at each stage (watch the animation!). MFMA blocks until complete. Registers hold everything — competing for space. No compute/memory overlap.
NVIDIA sends 6 packets in the same time AMD sends 2 packets.
Not because AMD hardware is slower — but because async vs sync architecture fundamentally changes throughput.
Both can hit peak FLOPS. But NVIDIA makes it easier for software to keep the tensor cores fed.