Wave Specialization: ✓ NVIDIA vs ✗ AMD

Producer waves handle memory operations while consumer waves handle compute. This pattern hits near-peak performance on NVIDIA but completely fails on AMD.

NVIDIA
✓ WORKS GREAT
Wave Specialization
👷 PRODUCER WAVES
Memory Operations
Warp 0-1
TMA Async Load → SMEM
Warp 2-3
TMA Async Load → TMEM
Issue & Move On
⚡ Non-blocking!
🧮 CONSUMER WAVES
Tensor Operations
Warpgroup A
WGMMA Async Execute
Warpgroup B
WGMMA Async Execute
Overlap!
⚡ Compute while loading
🔗 mbarrier: Hardware Producer-Consumer Sync

TMA Async Loads

Producer waves issue loads and immediately pivot to other work. Memory movement doesn't stall execution.

🔄

WGMMA Async Execute

Consumer waves issue WGMMA and move on. No waiting for completion.

🔗

mbarrier Sync

Fast, hardware-supported producer-consumer synchronization. Keeps pipeline full without excessive overhead.

AMD
✗ FAILS
Wave Specialization
👷 PRODUCER WAVES
Memory Operations
Wave 0-1
Buffer Load (uses ALU!) 🐢
Wave 2-3
Consumes Registers ⚠️
BLOCKED
Sitting on registers...
🧮 CONSUMER WAVES
Tensor Operations
Wavefront A
MFMA... 🛑 BLOCKING
Wavefront B
⏳ Waiting...
No Overlap
Must wait for MFMA
🛑
❌ No Hardware Producer-Consumer Sync
🐢

No TMA Equivalent

Producer waves rely on buffer loads that consume registers AND ALU cycles. Can't "issue and move on."

🛑

MFMA Blocks

No async MMA path. MFMA blocks until completion. You can't issue and move to other work.

📝

Static Register Allocation

Producer waves sit on registers that could be used for compute. Reduces overall occupancy and efficiency.

📄
"Wave specialization, as used on NVIDIA, does not translate well to AMD.
Their solution is to abandon wave specialization entirely and adopt a different execution strategy."
— HipKittens Paper
Aspect NVIDIA AMD
Memory Load TMA Async — issue and move on Buffer Load — consumes ALU + registers
Tensor Execute WGMMA Async — non-blocking MFMA Sync — blocks until complete
Producer-Consumer Sync mbarrier — hardware accelerated No equivalent — manual barriers
Register Allocation Dynamic — TMEM holds tiles Static — waves sit on registers
Overlap Capability Full compute/memory overlap No overlap — sequential execution
Wave Specialization ✓ Proven pattern, near-peak perf ✗ Must abandon entirely