HipKittens — AMD GPU Optimization Patterns

Section 1

The Core Problem: Wave Specialization Fails on AMD

NVIDIA's wave specialization strategy doesn't translate to AMD hardware due to fundamental architectural differences in register allocation and synchronization.

📊

NVIDIA Approach

On NVIDIA, wave specialization works beautifully: producer warps handle memory loads while consumer warps handle compute. Registers can be dynamically reallocated between warps, and mbarrier provides efficient synchronization.

⚠️

AMD Problem

AMD has static register allocation — registers are divided equally across all waves in a workgroup at launch time. Specialized waves can't get more registers than others, leading to 20% peak performance loss.

NVIDIA

Wave Specialization Works

Dynamic register reallocation + mbarrier sync

4 Warps, 2 Roles — Dynamic Registers

Producers

Warp 0 — Memory

Warp 1 — Memory

Registers: Can expand

Consumers

Warp 2 — Compute

Warp 3 — Compute

Registers: Allocated as needed

~98%

Peak TFLOPS achieved

AMD Problem

Wave Specialization Fails

Static register division across all waves

4 Waves, Static ¼ Registers Each

SIMD 0

Wave 0
Producer

25% regs

SIMD 1

Wave 1
Producer

25% regs

SIMD 2

Wave 2
Consumer

25% regs

SIMD 3

Wave 3
Consumer

25% regs

~80%

Peak TFLOPS — 20% loss!

🔧

Root Cause: Static Allocation

AMD's CDNA architecture divides the 256 VGPRs + 256 AGPRs equally across all waves in a workgroup. If you launch 4 waves, each gets ¼ of the registers — no reallocation possible.

⏱️

No mbarrier Equivalent

NVIDIA's mbarrier enables efficient producer-consumer handoff. AMD has no direct equivalent — you must use LDS barriers and conditional synchronization patterns.

🧮

MFMA is Synchronous

AMD's MFMA (Matrix Fused Multiply-Add) instructions block the wavefront until completion. Unlike NVIDIA's async WGMMA, there's no hardware-level overlap with memory operations.

Section 2

Solution A: 8-Wave Ping-Pong

For balanced workloads like GEMM where compute and memory are roughly equal. Each wave alternates between producer and consumer roles every iteration.

Pattern

Role Alternation with Conditional Barriers

8 waves across 4 SIMDs — every wave does both roles

98% Peak

Matches hand-tuned ASM

8 Waves: All Do Both Roles — Ping-Pong Alternation

SIMD 0

Wave 0

Wave 4

SIMD 1

Wave 1

Wave 5

SIMD 2

Wave 2

Wave 6

SIMD 3

Wave 3

Wave 7

Compute (MFMA)

Memory (LDS/Global)

↔ Roles swap every iteration

// Pseudocode: 8-Wave Ping-Pong Pattern
for (iteration = 0; iteration < num_tiles; iteration++) {
    bool is_producer = (wave_id + iteration) % 2 == 0;
    
    if (is_producer) {
        global_load(next_tile);     // Async global → LDS
        lds_store(tile_data);       // Write to shared memory
    }
    
    barrier();  // Sync point — all waves wait
    
    if (!is_producer) {
        lds_load(operands);         // Read from LDS
        mfma(C, A, B);              // Matrix multiply-accumulate
    }
    
    barrier();  // Sync before role swap
}

🔄

Role Alternation

Instead of fixed producer/consumer roles, every wave alternates. On even iterations, waves 0,2,4,6 are producers. On odd iterations, they become consumers. This ensures all waves contribute equally.

📊

Register Efficiency

With 8 waves sharing the register file, each gets ⅛ of total registers. But since every wave does both roles, all registers are actively used — no "idle" producer registers sitting unused.

🔒

Conditional Barriers

The barrier() calls ensure producers finish writing to LDS before consumers read. The alternating pattern means the barrier naturally separates the phases.

⚡

Pipeline Saturation

With 2 waves per SIMD, while one wave is blocked on MFMA, the other can issue memory operations. The hardware scheduler keeps both pipelines busy.

✓

Best For: GEMM

General Matrix Multiply has balanced compute and memory requirements. The ping-pong pattern keeps both the MFMA units and memory subsystem saturated throughout.

📈

Result

Achieves 98% of peak theoretical TFLOPS on MI300X — matching carefully hand-tuned assembly kernels but written in high-level HIP C++.

Section 3

Solution B: 4-Wave Interleave

For imbalanced workloads like Attention where compute dominates. Each wave handles both compute and memory, interleaved at the instruction level.

Pattern

Fine-Grained Compute + Memory Interleaving

4 waves, 1 per SIMD — maximum registers per wave

2.3× Speedup

GQA Attention Backwards

4 Waves: Each Does Both — Instruction-Level Interleave

SIMD 0

Wave 0

SIMD 1

Wave 1

SIMD 2

Wave 2

SIMD 3

Wave 3

MFMA (Compute)

LDS (Memory)

Interleaved at instruction level

// Pseudocode: 4-Wave Interleave Pattern
// Only 1 wave per SIMD — gets ALL the registers

for (iteration = 0; iteration < num_tiles; iteration++) {
    // Issue memory loads (non-blocking)
    async_global_load(next_tile_A);
    async_global_load(next_tile_B);
    
    // Interleave: compute current tile while loading next
    mfma(C, current_A[0], current_B[0]);  // Compute
    lds_load(current_A[1]);               // Memory
    mfma(C, current_A[1], current_B[1]);  // Compute
    lds_load(current_B[1]);               // Memory
    mfma(C, current_A[2], current_B[2]);  // Compute
    // ... pattern continues
    
    wait_for_loads();  // Ensure next tile is ready
    swap_buffers();    // Double buffering
}

📐

Fine-Grained Interleaving

Each wave interleaves compute and memory at the instruction level: MFMA → LDS → MFMA → LDS. While MFMA blocks the wavefront, the memory system can still process outstanding loads.

🎛️

Maximum Registers Per Wave

With only 1 wave per SIMD, each wave gets 100% of the register file — all 256 VGPRs + 256 AGPRs. This enables larger tiles and more data reuse.

⚡

Saturate Both Pipelines

Careful instruction scheduling keeps both the MFMA units and LDS pipelines busy. No pipeline sits idle waiting for the other.

🎯

No Wasted Waves

Unlike 8-wave ping-pong where half the waves are "idle" each phase, here every wave is always productive — doing both compute and memory.

✓

Best For: Attention

FlashAttention and similar kernels are compute-heavy with large accumulator state. The extra registers from 4-wave mode enable larger tiles and better data reuse.

📈

Result

Achieves 2.3× speedup on GQA (Grouped Query Attention) backwards pass compared to naive implementations.

Section 4

The Chiplet Problem: MI300X/MI355X

AMD's multi-chiplet architecture introduces new challenges: 8 XCDs with private L2 caches that don't automatically share data.

Problem

Naive Row-Major Scheduling

Each XCD processes different data regions

8 XCDs — No Data Locality

XCD0

XCD1

XCD2

XCD3

XCD4

XCD5

XCD6

XCD7

Each XCD processes different rows → No L2 cache sharing

36%

L2 Cache Hit Rate

⚠️

Private L2 Caches

Each XCD has its own private L2 cache. When XCD0 loads matrix A row 0 and XCD1 loads row 1, there's no cache sharing — both must go to LLC/HBM.

🔥

Cache Thrashing

Row-major scheduling spreads work across all 8 XCDs. Each loads different data, thrashing the shared LLC and wasting HBM bandwidth.

Solution

XCD Grouping + Hierarchical Traversal

Group XCDs to maximize L2 reuse

8 XCDs — Grouped for Data Locality

XCD0

XCD1

XCD2

XCD3

XCD4

XCD5

XCD6

XCD7

Group A

Group B

15.1 → 18.3 TB/s

Effective Memory Bandwidth

🎯

XCD Grouping

Group XCDs that work on adjacent tiles. XCDs in the same group share common matrix rows/columns, enabling L2 cache reuse within the group.

📐

Windowed Traversal

Instead of row-major order, use a "windowed" tile traversal that keeps adjacent XCDs working on overlapping data regions.

19% Faster

Memory bandwidth improvement

🏗️

MI300X Architecture

MI300X has 8 XCDs (Accelerator Complex Dies), each with its own L2 cache. All XCDs share a unified HBM memory pool through a shared Last Level Cache (LLC).

📊

Cache Hierarchy

L1 (per CU) → L2 (per XCD, private) → LLC (shared, last level) → HBM (global memory). The L2 cache is the key optimization target for chiplet-aware scheduling.

Section 5

Pattern Selection Guide

Which HipKittens pattern to use based on your workload characteristics.

Workload	Characteristic	Pattern	Result
GEMM	Balanced compute/memory	8-Wave Ping-Pong	98% peak TFLOPS
FlashAttention Fwd	Compute-heavy, large state	4-Wave Interleave	Near-optimal
GQA Attention Bwd	Heavily compute-bound	4-Wave Interleave	2.3× speedup
Multi-XCD Kernels	Data locality critical	XCD Grouping	19% improvement

📊

NVIDIA vs AMD Summary

Aspect	NVIDIA	AMD
Registers	Dynamic reallocation	Static division
Sync Primitive	mbarrier (async)	LDS barriers
Matrix Ops	WGMMA (async)	MFMA (sync)
L2 Cache	Unified	Per-XCD (private)
Wave Spec.	Native support	Ping-pong/Interleave

💡

Key Takeaways

1. Don't Copy NVIDIA Patterns Blindly

Wave specialization that works on NVIDIA fails on AMD. Understand the architectural differences.

2. Choose Pattern by Workload

8-wave ping-pong for balanced workloads, 4-wave interleave for compute-heavy kernels.

3. Mind the Chiplets

On MI300X/MI355X, L2 cache is per-XCD. Use XCD grouping for data locality.