🧵 BLACKWELL ARCHITECTURE 🧵

Warpgroups: 128 Threads as One

How NVIDIA's Blackwell coordinates 4 warps with implicit hardware synchronization

What is a Warpgroup?

WARPGROUP = 4 WARPS = 128 THREADS

WARP 0

32 threads

WARP 1

32 threads

WARP 2

32 threads

WARP 3

32 threads

⚡ HARDWARE SYNCHRONIZED ⚡

🧵 4 Warps = 128 Threads

Blackwell tensor operations (tcgen05.mma) operate at the warpgroup level, not individual warp level. All 128 threads execute as a single coordinated unit.

🔄 Implicit Synchronization

The hardware automatically synchronizes all 4 warps. No __syncthreads() needed. No barriers. No race conditions. Hardware enforces correctness.

📦 Simplified Programming

Think of a warpgroup as one big thread with 128-way parallelism built in. The complexity is hidden by hardware.

Lane Slice Ownership of TMEM

TMEM — 256KB

Divided into 4 lane slices, one per warp

SLICE 0

64KB

Warp 0 owns

SLICE 1

64KB

Warp 1 owns

SLICE 2

64KB

Warp 2 owns

SLICE 3

64KB

Warp 3 owns

↑ Each warp loads its own slice — no cross-warp coordination ↑

🎯 Fixed Ownership

Each warp owns a fixed 64KB slice of TMEM. Warp 0 always writes to slice 0. No dynamic allocation. No conflicts.

🚫 No Cross-Warp Coordination

When loading matrix tiles with tcgen05.cp, each warp independently fills its slice. No waiting on other warps.

⚡ Parallel Loading

All 4 warps can load their slices simultaneously. This is why TMEM + warpgroups achieve such high bandwidth utilization.

✅ Result

Hardware enforces structure so software doesn't have to. The programmer thinks in warpgroups, not individual warps.

NVIDIA Warpgroup vs AMD Wavefront

🟢

NVIDIA Warpgroup

Blackwell Architecture

✓ 128 threads (4 warps × 32)

✓ Implicit sync — hardware enforced

✓ TMEM access — dedicated scratchpad

✓ Lane slices — no coordination needed

✓ Async operations — tcgen05.mma

Programming model: Think in warpgroups, not warps

🔴

AMD Wavefront

CDNA3/4 Architecture

• 64 threads (1 wavefront)

• Explicit sync — software barriers

• LDS + registers — shared memory + VGPR/AGPR

• Manual coordination — between wavefronts

• Sync operations — MFMA blocks

Programming model: Explicit wave management required

💡

The Key Insight

Warpgroups let NVIDIA push complexity into hardware. Instead of making programmers coordinate 4 warps manually, the hardware does it automatically.

This is why Blackwell achieves near-peak tensor utilization with simpler code.

Hardware enforces structure → Software stays simple