BLACKWELL INNOVATION

Why TMEM Exists

The Register Pressure Problem on Hopper — and How Blackwell Fixes It

Before TMEM, tensor operands lived in registers. On Hopper, a single warpgroup running WGMMA could consume 40–60% of the register file just to hold matrix tiles.

❌ BEFORE: Hopper (H100)

Matrix A Tile (~20%)

Matrix B Tile (~20%)

Accumulator C (~15%)

55% CONSUMED BY TILES!

Only ~45% left for everything else:

📍 Addresses

Cramped!

🔄 Loops

Cramped!

📋 Async

Cramped!

⚠️ Limited occupancy (fewer warps can run)

⚠️ Register spilling to memory (slow!)

⚠️ Complex register allocation

✅ AFTER: Blackwell (B200)

NEW: TMEM (256KB)

Matrix A ✓

Matrix B ✓

Accumulator C ✓

Dedicated tensor scratchpad

Registers: 100% FREE!

📍 Address Calculations

Plenty of room!

🔄 Loop Counters

Plenty of room!

📋 Async Bookkeeping

Plenty of room!

No tiles hogging space!

✓ Higher occupancy (more warps = better latency hiding)

✓ No register spilling needed

✓ Simpler kernel development

How TMEM Works: Two Key Instructions

tcgen05.cp

Copies data from Shared Memory → TMEM

⚡ Bypasses registers entirely!

tcgen05.mma

Matrix multiply reading A,B from TMEM

⚡ Tensor core reads directly from TMEM!

💡

Registers are expensive. TMEM exists so tensor operands don't waste them.

Result: More warps, better utilization, faster kernels.