BLACKWELL INNOVATION

Why TMEM Exists

The Register Pressure Problem on Hopper — and How Blackwell Fixes It

Before TMEM, tensor operands lived in registers. On Hopper, a single warpgroup running WGMMA could consume 40–60% of the register file just to hold matrix tiles.

❌ BEFORE: Hopper (H100)
Register File (256 registers per thread)
Matrix A Tile (~20%)
Matrix B Tile (~20%)
Accumulator C (~15%)
55% CONSUMED BY TILES!
Only ~45% left for everything else:
📍 Addresses
Cramped!
🔄 Loops
Cramped!
📋 Async
Cramped!
⚠️ Limited occupancy (fewer warps can run)
⚠️ Register spilling to memory (slow!)
⚠️ Complex register allocation
✅ AFTER: Blackwell (B200)
NEW: TMEM (256KB)
Matrix A ✓
Matrix B ✓
Accumulator C ✓
Dedicated tensor scratchpad
Registers: 100% FREE!
📍 Address Calculations
Plenty of room!
🔄 Loop Counters
Plenty of room!
📋 Async Bookkeeping
Plenty of room!
No tiles hogging space!
Higher occupancy (more warps = better latency hiding)
No register spilling needed
Simpler kernel development

How TMEM Works: Two Key Instructions

tcgen05.cp

Copies data from Shared Memory → TMEM

⚡ Bypasses registers entirely!

tcgen05.mma

Matrix multiply reading A,B from TMEM

⚡ Tensor core reads directly from TMEM!

💡

Registers are expensive. TMEM exists so tensor operands don't waste them.

Result: More warps, better utilization, faster kernels.

© 2026 Subramaniyam Pooni | CS²B Technologies | Based on "Feeding the Tensor Cores" by Emilio Andere