NVIDIA Tensor Core Evolution — Volta → Ampere/Hopper → Blackwell

VOLTA

WMMA — Warp-Level, Synchronous

First-generation tensor cores. Single warp (32 threads) executes matrix ops synchronously.

2017

🌐 Global Memory

HBM2

LDGSTS (threads gen addr)

📦 Shared Memory

SMEM

Load to Registers ⏳

📝 Register File

Holds A, B, C matrices + addresses + everything

⚠️ HEAVY PRESSURE

WMMA (sync) 🛑

🧮 Tensor Core

WMMA • 32 threads (1 warp)

SYNCHRONOUS

📋 Key Characteristics

👥

Warp-Level (32 threads)

Single warp executes matrix multiply. Limited parallelism.

🛑

Synchronous Execution

Must wait for WMMA to complete before next operation. No overlap.

📝

Register Pressure

All matrix tiles live in registers. Competes with addresses, loop vars.

🐢

Manual Address Gen

Threads compute addresses, wasting cycles on bookkeeping.

32

THREADS

SYNC

EXECUTION

HIGH

REG PRESSURE

Relative Throughput: ~4s cycle with pauses

AMPERE / HOPPER

WGMMA — Warpgroup-Level, Asynchronous

128 threads (4 warps) work together. TMA enables async data movement.

2020-22

🌐 Global Memory

HBM2e / HBM3

TMA Async ⚡ NEW!

⚡ TMA Unit

Hardware Address Generation

🆕 ASYNC ENGINE

Async Copy

📦 Shared Memory

SMEM

Load to Registers

📝 Register File

Still holds matrix tiles (40-60%!)

⚠️ STILL BUSY

WGMMA (async) ⚡

🧮 Tensor Core

WGMMA • 128 threads (4 warps)

ASYNC!

📋 Key Improvements

👥

Warpgroup-Level (128 threads)

4 warps cooperate as one unit. 4× more threads than Volta!

⚡

TMA Unit (NEW!)

Hardware handles address generation. Threads freed from bookkeeping.

🔄

Async Execution

WGMMA is async! Issue and move on. Overlap compute + memory.

⚠️

Register Pressure Remains

Matrix tiles still in registers. 40-60% consumed by WGMMA operands.

128

THREADS

ASYNC

EXECUTION

40-60%

REG FOR TILES

Relative Throughput: ~3s cycle, continuous flow

BLACKWELL

tcgen05 + TMEM — Dedicated Tensor Scratchpad

256KB TMEM per SM. Matrix operands bypass registers entirely!

2024

🌐 Global Memory

HBM3e • 8 TB/s

TMA Async ⚡

⚡ TMA Unit

Hardware Address Generation

Async Copy

📦 Shared Memory

SMEM

tcgen05.cp 🔥 NEW!

🔥 TMEM

256KB Dedicated Tensor Memory

🆕 BYPASSES REGISTERS!

tcgen05.mma

📝 Register File

For addresses, loops, control — NOT tiles!

✅ NOW FREE!

Warpgroup MMA ⚡

🧮 Tensor Core

tcgen05 • 128 threads (warpgroup)

ASYNC + DEDICATED MEM

📋 Revolutionary Changes

🔥

TMEM: 256KB per SM (NEW!)

Dedicated tensor memory. Matrix A,B operands live here, NOT in registers!

📝

Registers Finally FREE

No more 40-60% for tiles. Registers for addresses, loops, actual work.

⚡

tcgen05.cp: SMEM → TMEM

Direct copy that completely bypasses the register file.

🚀

Maximum Throughput

Async TMA + TMEM + free registers = optimal tensor core utilization.

128

THREADS

256KB

TMEM

0%

REG FOR TILES

Relative Throughput: ~2.5s cycle, maximum flow! 🚀