AMD GPU Optimization Patterns

How Stanford's Hazy Research team achieved NVIDIA-competitive performance on AMD hardware through novel wave scheduling patterns

98%
Peak TFLOPS (GEMM)
2.3×
Attention Speedup
19%
Chiplet Optimization
Section 1

The Core Problem: Wave Specialization Fails on AMD

NVIDIA's wave specialization strategy doesn't translate to AMD hardware due to fundamental architectural differences in register allocation and synchronization.

📊
NVIDIA Approach

On NVIDIA, wave specialization works beautifully: producer warps handle memory loads while consumer warps handle compute. Registers can be dynamically reallocated between warps, and mbarrier provides efficient synchronization.

⚠️
AMD Problem

AMD has static register allocation — registers are divided equally across all waves in a workgroup at launch time. Specialized waves can't get more registers than others, leading to 20% peak performance loss.

NVIDIA

Wave Specialization Works

Dynamic register reallocation + mbarrier sync
4 Warps, 2 Roles — Dynamic Registers
Producers
Warp 0 — Memory
Warp 1 — Memory
Registers: Can expand
Consumers
Warp 2 — Compute
Warp 3 — Compute
Registers: Allocated as needed
~98%
Peak TFLOPS achieved
AMD Problem

Wave Specialization Fails

Static register division across all waves
4 Waves, Static ¼ Registers Each
SIMD 0
Wave 0
Producer
25% regs
SIMD 1
Wave 1
Producer
25% regs
SIMD 2
Wave 2
Consumer
25% regs
SIMD 3
Wave 3
Consumer
25% regs
~80%
Peak TFLOPS — 20% loss!
🔧
Root Cause: Static Allocation

AMD's CDNA architecture divides the 256 VGPRs + 256 AGPRs equally across all waves in a workgroup. If you launch 4 waves, each gets ¼ of the registers — no reallocation possible.

⏱️
No mbarrier Equivalent

NVIDIA's mbarrier enables efficient producer-consumer handoff. AMD has no direct equivalent — you must use LDS barriers and conditional synchronization patterns.

🧮
MFMA is Synchronous

AMD's MFMA (Matrix Fused Multiply-Add) instructions block the wavefront until completion. Unlike NVIDIA's async WGMMA, there's no hardware-level overlap with memory operations.

Section 2

Solution A: 8-Wave Ping-Pong

For balanced workloads like GEMM where compute and memory are roughly equal. Each wave alternates between producer and consumer roles every iteration.

Pattern

Role Alternation with Conditional Barriers

8 waves across 4 SIMDs — every wave does both roles
98% Peak
Matches hand-tuned ASM
8 Waves: All Do Both Roles — Ping-Pong Alternation
SIMD 0
Wave 0
Wave 4
SIMD 1
Wave 1
Wave 5
SIMD 2
Wave 2
Wave 6
SIMD 3
Wave 3
Wave 7
Compute (MFMA)
Memory (LDS/Global)
↔ Roles swap every iteration
// Pseudocode: 8-Wave Ping-Pong Pattern
for (iteration = 0; iteration < num_tiles; iteration++) {
    bool is_producer = (wave_id + iteration) % 2 == 0;
    
    if (is_producer) {
        global_load(next_tile);     // Async global → LDS
        lds_store(tile_data);       // Write to shared memory
    }
    
    barrier();  // Sync point — all waves wait
    
    if (!is_producer) {
        lds_load(operands);         // Read from LDS
        mfma(C, A, B);              // Matrix multiply-accumulate
    }
    
    barrier();  // Sync before role swap
}
🔄

Role Alternation

Instead of fixed producer/consumer roles, every wave alternates. On even iterations, waves 0,2,4,6 are producers. On odd iterations, they become consumers. This ensures all waves contribute equally.

📊

Register Efficiency

With 8 waves sharing the register file, each gets ⅛ of total registers. But since every wave does both roles, all registers are actively used — no "idle" producer registers sitting unused.

🔒

Conditional Barriers

The barrier() calls ensure producers finish writing to LDS before consumers read. The alternating pattern means the barrier naturally separates the phases.

Pipeline Saturation

With 2 waves per SIMD, while one wave is blocked on MFMA, the other can issue memory operations. The hardware scheduler keeps both pipelines busy.

Best For: GEMM

General Matrix Multiply has balanced compute and memory requirements. The ping-pong pattern keeps both the MFMA units and memory subsystem saturated throughout.

📈
Result

Achieves 98% of peak theoretical TFLOPS on MI300X — matching carefully hand-tuned assembly kernels but written in high-level HIP C++.

Section 3

Solution B: 4-Wave Interleave

For imbalanced workloads like Attention where compute dominates. Each wave handles both compute and memory, interleaved at the instruction level.

Pattern

Fine-Grained Compute + Memory Interleaving

4 waves, 1 per SIMD — maximum registers per wave
2.3× Speedup
GQA Attention Backwards
4 Waves: Each Does Both — Instruction-Level Interleave
SIMD 0
Wave 0
SIMD 1
Wave 1
SIMD 2
Wave 2
SIMD 3
Wave 3
MFMA (Compute)
LDS (Memory)
Interleaved at instruction level
// Pseudocode: 4-Wave Interleave Pattern
// Only 1 wave per SIMD — gets ALL the registers

for (iteration = 0; iteration < num_tiles; iteration++) {
    // Issue memory loads (non-blocking)
    async_global_load(next_tile_A);
    async_global_load(next_tile_B);
    
    // Interleave: compute current tile while loading next
    mfma(C, current_A[0], current_B[0]);  // Compute
    lds_load(current_A[1]);               // Memory
    mfma(C, current_A[1], current_B[1]);  // Compute
    lds_load(current_B[1]);               // Memory
    mfma(C, current_A[2], current_B[2]);  // Compute
    // ... pattern continues
    
    wait_for_loads();  // Ensure next tile is ready
    swap_buffers();    // Double buffering
}
📐

Fine-Grained Interleaving

Each wave interleaves compute and memory at the instruction level: MFMA → LDS → MFMA → LDS. While MFMA blocks the wavefront, the memory system can still process outstanding loads.

🎛️

Maximum Registers Per Wave

With only 1 wave per SIMD, each wave gets 100% of the register file — all 256 VGPRs + 256 AGPRs. This enables larger tiles and more data reuse.

Saturate Both Pipelines

Careful instruction scheduling keeps both the MFMA units and LDS pipelines busy. No pipeline sits idle waiting for the other.

🎯

No Wasted Waves

Unlike 8-wave ping-pong where half the waves are "idle" each phase, here every wave is always productive — doing both compute and memory.

Best For: Attention

FlashAttention and similar kernels are compute-heavy with large accumulator state. The extra registers from 4-wave mode enable larger tiles and better data reuse.

📈
Result

Achieves 2.3× speedup on GQA (Grouped Query Attention) backwards pass compared to naive implementations.

Section 4

The Chiplet Problem: MI300X/MI355X

AMD's multi-chiplet architecture introduces new challenges: 8 XCDs with private L2 caches that don't automatically share data.

Problem

Naive Row-Major Scheduling

Each XCD processes different data regions
8 XCDs — No Data Locality
XCD0
XCD1
XCD2
XCD3
XCD4
XCD5
XCD6
XCD7

Each XCD processes different rows → No L2 cache sharing

36%
L2 Cache Hit Rate
⚠️

Private L2 Caches

Each XCD has its own private L2 cache. When XCD0 loads matrix A row 0 and XCD1 loads row 1, there's no cache sharing — both must go to LLC/HBM.

🔥

Cache Thrashing

Row-major scheduling spreads work across all 8 XCDs. Each loads different data, thrashing the shared LLC and wasting HBM bandwidth.

Solution

XCD Grouping + Hierarchical Traversal

Group XCDs to maximize L2 reuse
8 XCDs — Grouped for Data Locality
XCD0
XCD1
XCD2
XCD3
XCD4
XCD5
XCD6
XCD7
Group A
Group B
15.1 → 18.3 TB/s
Effective Memory Bandwidth
🎯

XCD Grouping

Group XCDs that work on adjacent tiles. XCDs in the same group share common matrix rows/columns, enabling L2 cache reuse within the group.

📐

Windowed Traversal

Instead of row-major order, use a "windowed" tile traversal that keeps adjacent XCDs working on overlapping data regions.

19% Faster
Memory bandwidth improvement
🏗️
MI300X Architecture

MI300X has 8 XCDs (Accelerator Complex Dies), each with its own L2 cache. All XCDs share a unified HBM memory pool through a shared Last Level Cache (LLC).

📊
Cache Hierarchy

L1 (per CU) → L2 (per XCD, private) → LLC (shared, last level) → HBM (global memory). The L2 cache is the key optimization target for chiplet-aware scheduling.

Section 5

Pattern Selection Guide

Which HipKittens pattern to use based on your workload characteristics.

Workload Characteristic Pattern Result
GEMM Balanced compute/memory 8-Wave Ping-Pong 98% peak TFLOPS
FlashAttention Fwd Compute-heavy, large state 4-Wave Interleave Near-optimal
GQA Attention Bwd Heavily compute-bound 4-Wave Interleave 2.3× speedup
Multi-XCD Kernels Data locality critical XCD Grouping 19% improvement
📊
NVIDIA vs AMD Summary
Aspect NVIDIA AMD
Registers Dynamic reallocation Static division
Sync Primitive mbarrier (async) LDS barriers
Matrix Ops WGMMA (async) MFMA (sync)
L2 Cache Unified Per-XCD (private)
Wave Spec. Native support Ping-pong/Interleave
💡
Key Takeaways

1. Don't Copy NVIDIA Patterns Blindly

Wave specialization that works on NVIDIA fails on AMD. Understand the architectural differences.

2. Choose Pattern by Workload

8-wave ping-pong for balanced workloads, 4-wave interleave for compute-heavy kernels.

3. Mind the Chiplets

On MI300X/MI355X, L2 cache is per-XCD. Use XCD grouping for data locality.