Watch how data flow improved from Volta → Ampere/Hopper → Blackwell
First-generation tensor cores. Single warp (32 threads) executes matrix ops synchronously.
Single warp executes matrix multiply. Limited parallelism.
Must wait for WMMA to complete before next operation. No overlap.
All matrix tiles live in registers. Competes with addresses, loop vars.
Threads compute addresses, wasting cycles on bookkeeping.
128 threads (4 warps) work together. TMA enables async data movement.
4 warps cooperate as one unit. 4× more threads than Volta!
Hardware handles address generation. Threads freed from bookkeeping.
WGMMA is async! Issue and move on. Overlap compute + memory.
Matrix tiles still in registers. 40-60% consumed by WGMMA operands.
256KB TMEM per SM. Matrix operands bypass registers entirely!
Dedicated tensor memory. Matrix A,B operands live here, NOT in registers!
No more 40-60% for tiles. Registers for addresses, loops, actual work.
Direct copy that completely bypasses the register file.
Async TMA + TMEM + free registers = optimal tensor core utilization.