🔬 NVIDIA TENSOR CORE EVOLUTION

Three Generations of Innovation

Watch how data flow improved from Volta → Ampere/Hopper → Blackwell

VOLTA

WMMA — Warp-Level, Synchronous

First-generation tensor cores. Single warp (32 threads) executes matrix ops synchronously.

2017
🌐 Global Memory
HBM2
LDGSTS (threads gen addr)
📦 Shared Memory
SMEM
Load to Registers ⏳
📝 Register File
Holds A, B, C matrices + addresses + everything
⚠️ HEAVY PRESSURE
WMMA (sync) 🛑
🧮 Tensor Core
WMMA • 32 threads (1 warp)
SYNCHRONOUS
📋 Key Characteristics
👥

Warp-Level (32 threads)

Single warp executes matrix multiply. Limited parallelism.

🛑

Synchronous Execution

Must wait for WMMA to complete before next operation. No overlap.

📝

Register Pressure

All matrix tiles live in registers. Competes with addresses, loop vars.

🐢

Manual Address Gen

Threads compute addresses, wasting cycles on bookkeeping.

32
THREADS
SYNC
EXECUTION
HIGH
REG PRESSURE
Relative Throughput: ~4s cycle with pauses
⬇️
AMPERE / HOPPER

WGMMA — Warpgroup-Level, Asynchronous

128 threads (4 warps) work together. TMA enables async data movement.

2020-22
🌐 Global Memory
HBM2e / HBM3
TMA Async ⚡ NEW!
⚡ TMA Unit
Hardware Address Generation
🆕 ASYNC ENGINE
Async Copy
📦 Shared Memory
SMEM
Load to Registers
📝 Register File
Still holds matrix tiles (40-60%!)
⚠️ STILL BUSY
WGMMA (async) ⚡
🧮 Tensor Core
WGMMA • 128 threads (4 warps)
ASYNC!
📋 Key Improvements
👥

Warpgroup-Level (128 threads)

4 warps cooperate as one unit. 4× more threads than Volta!

TMA Unit (NEW!)

Hardware handles address generation. Threads freed from bookkeeping.

🔄

Async Execution

WGMMA is async! Issue and move on. Overlap compute + memory.

⚠️

Register Pressure Remains

Matrix tiles still in registers. 40-60% consumed by WGMMA operands.

128
THREADS
ASYNC
EXECUTION
40-60%
REG FOR TILES
Relative Throughput: ~3s cycle, continuous flow
⬇️
BLACKWELL

tcgen05 + TMEM — Dedicated Tensor Scratchpad

256KB TMEM per SM. Matrix operands bypass registers entirely!

2024
🌐 Global Memory
HBM3e • 8 TB/s
TMA Async ⚡
⚡ TMA Unit
Hardware Address Generation
Async Copy
📦 Shared Memory
SMEM
tcgen05.cp 🔥 NEW!
🔥 TMEM
256KB Dedicated Tensor Memory
🆕 BYPASSES REGISTERS!
tcgen05.mma
📝 Register File
For addresses, loops, control — NOT tiles!
✅ NOW FREE!
Warpgroup MMA ⚡
🧮 Tensor Core
tcgen05 • 128 threads (warpgroup)
ASYNC + DEDICATED MEM
📋 Revolutionary Changes
🔥

TMEM: 256KB per SM (NEW!)

Dedicated tensor memory. Matrix A,B operands live here, NOT in registers!

📝

Registers Finally FREE

No more 40-60% for tiles. Registers for addresses, loops, actual work.

tcgen05.cp: SMEM → TMEM

Direct copy that completely bypasses the register file.

🚀

Maximum Throughput

Async TMA + TMEM + free registers = optimal tensor core utilization.

128
THREADS
256KB
TMEM
0%
REG FOR TILES
Relative Throughput: ~2.5s cycle, maximum flow! 🚀