The Register Pressure Problem on Hopper — and How Blackwell Fixes It
Before TMEM, tensor operands lived in registers. On Hopper, a single warpgroup running WGMMA could consume 40–60% of the register file just to hold matrix tiles.
tcgen05.cp
Copies data from Shared Memory → TMEM
⚡ Bypasses registers entirely!
tcgen05.mma
Matrix multiply reading A,B from TMEM
⚡ Tensor core reads directly from TMEM!
Registers are expensive. TMEM exists so tensor operands don't waste them.
Result: More warps, better utilization, faster kernels.
© 2026 Subramaniyam Pooni | CS²B Technologies | Based on "Feeding the Tensor Cores" by Emilio Andere