🧵 BLACKWELL ARCHITECTURE 🧵

Warpgroups: 128 Threads as One

How NVIDIA's Blackwell coordinates 4 warps with implicit hardware synchronization

What is a Warpgroup?

WARPGROUP = 4 WARPS = 128 THREADS
WARP 0
32 threads
WARP 1
32 threads
WARP 2
32 threads
WARP 3
32 threads
⚡ HARDWARE SYNCHRONIZED ⚡

🧵 4 Warps = 128 Threads

Blackwell tensor operations (tcgen05.mma) operate at the warpgroup level, not individual warp level. All 128 threads execute as a single coordinated unit.

🔄 Implicit Synchronization

The hardware automatically synchronizes all 4 warps. No __syncthreads() needed. No barriers. No race conditions. Hardware enforces correctness.

📦 Simplified Programming

Think of a warpgroup as one big thread with 128-way parallelism built in. The complexity is hidden by hardware.

Lane Slice Ownership of TMEM

TMEM — 256KB

Divided into 4 lane slices, one per warp

SLICE 0
64KB
Warp 0 owns
SLICE 1
64KB
Warp 1 owns
SLICE 2
64KB
Warp 2 owns
SLICE 3
64KB
Warp 3 owns
↑ Each warp loads its own slice — no cross-warp coordination ↑

🎯 Fixed Ownership

Each warp owns a fixed 64KB slice of TMEM. Warp 0 always writes to slice 0. No dynamic allocation. No conflicts.

🚫 No Cross-Warp Coordination

When loading matrix tiles with tcgen05.cp, each warp independently fills its slice. No waiting on other warps.

⚡ Parallel Loading

All 4 warps can load their slices simultaneously. This is why TMEM + warpgroups achieve such high bandwidth utilization.

✅ Result

Hardware enforces structure so software doesn't have to. The programmer thinks in warpgroups, not individual warps.

NVIDIA Warpgroup vs AMD Wavefront

🟢

NVIDIA Warpgroup

Blackwell Architecture

128 threads (4 warps × 32)
Implicit sync — hardware enforced
TMEM access — dedicated scratchpad
Lane slices — no coordination needed
Async operations — tcgen05.mma

Programming model: Think in warpgroups, not warps

🔴

AMD Wavefront

CDNA3/4 Architecture

64 threads (1 wavefront)
Explicit sync — software barriers
LDS + registers — shared memory + VGPR/AGPR
Manual coordination — between wavefronts
Sync operations — MFMA blocks

Programming model: Explicit wave management required

💡

The Key Insight

Warpgroups let NVIDIA push complexity into hardware. Instead of making programmers coordinate 4 warps manually, the hardware does it automatically.

This is why Blackwell achieves near-peak tensor utilization with simpler code.

Hardware enforces structure → Software stays simple