How Stanford's Hazy Research team achieved NVIDIA-competitive performance on AMD hardware through novel wave scheduling patterns
NVIDIA's wave specialization strategy doesn't translate to AMD hardware due to fundamental architectural differences in register allocation and synchronization.
On NVIDIA, wave specialization works beautifully: producer warps handle memory loads while consumer warps handle compute. Registers can be dynamically reallocated between warps, and mbarrier provides efficient synchronization.
AMD has static register allocation — registers are divided equally across all waves in a workgroup at launch time. Specialized waves can't get more registers than others, leading to 20% peak performance loss.
AMD's CDNA architecture divides the 256 VGPRs + 256 AGPRs equally across all waves in a workgroup. If you launch 4 waves, each gets ¼ of the registers — no reallocation possible.
NVIDIA's mbarrier enables efficient producer-consumer handoff. AMD has no direct equivalent — you must use LDS barriers and conditional synchronization patterns.
AMD's MFMA (Matrix Fused Multiply-Add) instructions block the wavefront until completion. Unlike NVIDIA's async WGMMA, there's no hardware-level overlap with memory operations.
For balanced workloads like GEMM where compute and memory are roughly equal. Each wave alternates between producer and consumer roles every iteration.
// Pseudocode: 8-Wave Ping-Pong Pattern for (iteration = 0; iteration < num_tiles; iteration++) { bool is_producer = (wave_id + iteration) % 2 == 0; if (is_producer) { global_load(next_tile); // Async global → LDS lds_store(tile_data); // Write to shared memory } barrier(); // Sync point — all waves wait if (!is_producer) { lds_load(operands); // Read from LDS mfma(C, A, B); // Matrix multiply-accumulate } barrier(); // Sync before role swap }
Instead of fixed producer/consumer roles, every wave alternates. On even iterations, waves 0,2,4,6 are producers. On odd iterations, they become consumers. This ensures all waves contribute equally.
With 8 waves sharing the register file, each gets ⅛ of total registers. But since every wave does both roles, all registers are actively used — no "idle" producer registers sitting unused.
The barrier() calls ensure producers finish writing to LDS before consumers read. The alternating pattern means the barrier naturally separates the phases.
With 2 waves per SIMD, while one wave is blocked on MFMA, the other can issue memory operations. The hardware scheduler keeps both pipelines busy.
General Matrix Multiply has balanced compute and memory requirements. The ping-pong pattern keeps both the MFMA units and memory subsystem saturated throughout.
Achieves 98% of peak theoretical TFLOPS on MI300X — matching carefully hand-tuned assembly kernels but written in high-level HIP C++.
For imbalanced workloads like Attention where compute dominates. Each wave handles both compute and memory, interleaved at the instruction level.
// Pseudocode: 4-Wave Interleave Pattern // Only 1 wave per SIMD — gets ALL the registers for (iteration = 0; iteration < num_tiles; iteration++) { // Issue memory loads (non-blocking) async_global_load(next_tile_A); async_global_load(next_tile_B); // Interleave: compute current tile while loading next mfma(C, current_A[0], current_B[0]); // Compute lds_load(current_A[1]); // Memory mfma(C, current_A[1], current_B[1]); // Compute lds_load(current_B[1]); // Memory mfma(C, current_A[2], current_B[2]); // Compute // ... pattern continues wait_for_loads(); // Ensure next tile is ready swap_buffers(); // Double buffering }
Each wave interleaves compute and memory at the instruction level: MFMA → LDS → MFMA → LDS. While MFMA blocks the wavefront, the memory system can still process outstanding loads.
With only 1 wave per SIMD, each wave gets 100% of the register file — all 256 VGPRs + 256 AGPRs. This enables larger tiles and more data reuse.
Careful instruction scheduling keeps both the MFMA units and LDS pipelines busy. No pipeline sits idle waiting for the other.
Unlike 8-wave ping-pong where half the waves are "idle" each phase, here every wave is always productive — doing both compute and memory.
FlashAttention and similar kernels are compute-heavy with large accumulator state. The extra registers from 4-wave mode enable larger tiles and better data reuse.
Achieves 2.3× speedup on GQA (Grouped Query Attention) backwards pass compared to naive implementations.
AMD's multi-chiplet architecture introduces new challenges: 8 XCDs with private L2 caches that don't automatically share data.
Each XCD processes different rows → No L2 cache sharing
Each XCD has its own private L2 cache. When XCD0 loads matrix A row 0 and XCD1 loads row 1, there's no cache sharing — both must go to LLC/HBM.
Row-major scheduling spreads work across all 8 XCDs. Each loads different data, thrashing the shared LLC and wasting HBM bandwidth.
Group XCDs that work on adjacent tiles. XCDs in the same group share common matrix rows/columns, enabling L2 cache reuse within the group.
Instead of row-major order, use a "windowed" tile traversal that keeps adjacent XCDs working on overlapping data regions.
MI300X has 8 XCDs (Accelerator Complex Dies), each with its own L2 cache. All XCDs share a unified HBM memory pool through a shared Last Level Cache (LLC).
L1 (per CU) → L2 (per XCD, private) → LLC (shared, last level) → HBM (global memory). The L2 cache is the key optimization target for chiplet-aware scheduling.
Which HipKittens pattern to use based on your workload characteristics.
| Workload | Characteristic | Pattern | Result |
|---|---|---|---|
| GEMM | Balanced compute/memory | 8-Wave Ping-Pong | 98% peak TFLOPS |
| FlashAttention Fwd | Compute-heavy, large state | 4-Wave Interleave | Near-optimal |
| GQA Attention Bwd | Heavily compute-bound | 4-Wave Interleave | 2.3× speedup |
| Multi-XCD Kernels | Data locality critical | XCD Grouping | 19% improvement |
| Aspect | NVIDIA | AMD |
|---|---|---|
| Registers | Dynamic reallocation | Static division |
| Sync Primitive | mbarrier (async) | LDS barriers |
| Matrix Ops | WGMMA (async) | MFMA (sync) |
| L2 Cache | Unified | Per-XCD (private) |
| Wave Spec. | Native support | Ping-pong/Interleave |
Wave specialization that works on NVIDIA fails on AMD. Understand the architectural differences.
8-wave ping-pong for balanced workloads, 4-wave interleave for compute-heavy kernels.
On MI300X/MI355X, L2 cache is per-XCD. Use XCD grouping for data locality.