Producer waves handle memory operations while consumer waves handle compute. This pattern hits near-peak performance on NVIDIA but completely fails on AMD.
Producer waves issue loads and immediately pivot to other work. Memory movement doesn't stall execution.
Consumer waves issue WGMMA and move on. No waiting for completion.
Fast, hardware-supported producer-consumer synchronization. Keeps pipeline full without excessive overhead.
Producer waves rely on buffer loads that consume registers AND ALU cycles. Can't "issue and move on."
No async MMA path. MFMA blocks until completion. You can't issue and move to other work.
Producer waves sit on registers that could be used for compute. Reduces overall occupancy and efficiency.
| Aspect | NVIDIA | AMD |
|---|---|---|
| Memory Load | TMA Async — issue and move on | Buffer Load — consumes ALU + registers |
| Tensor Execute | WGMMA Async — non-blocking | MFMA Sync — blocks until complete |
| Producer-Consumer Sync | mbarrier — hardware accelerated | No equivalent — manual barriers |
| Register Allocation | Dynamic — TMEM holds tiles | Static — waves sit on registers |
| Overlap Capability | Full compute/memory overlap | No overlap — sequential execution |
| Wave Specialization | ✓ Proven pattern, near-peak perf | ✗ Must abandon entirely |