🔥 TMEM Data Flow

tcgen05.cp copies data from Shared Memory directly to TMEM, bypassing registers entirely. Then tcgen05.mma reads operands straight from TMEM for matrix multiply.

Matrix A
Matrix B
📦
Shared Memory (SMEM)
Matrix tiles loaded here first
tcgen05.cp
SMEM → TMEM (bypasses registers!)
👥
Threads
Issue & move on
ASYNC
Working on
other tasks!
🔥
TMEM
256KB Dedicated Tensor Memory
Matrix A & B Operands Live Here!
A tile
B tile
📝
Registers
Now FREE!
✅ FOR COMPUTE
BYPASSED!
No tiles here
tcgen05.mma
Reads A,B from TMEM → Compute
🧮
Tensor Core
Matrix Multiply-Accumulate
D = A × B + C

⚡ tcgen05.cp

Copy directly from Shared Memory to TMEM. The key innovation: this instruction completely bypasses the register file.

🚀
Registers are no longer consumed by matrix tiles!
They're FREE for addresses, loop counters, and actual computation.

🧮 tcgen05.mma

Matrix multiply that reads A and B operands straight from TMEM. No register staging required!

Async execution: issue the op and immediately move on.
When results needed, explicit commit drains the pipeline.

⚡ The Async Advantage

Threads issue the op and immediately move on to other work. The data movement and computation happen in the background. When you need results, an explicit commit drains the pipeline. This enables overlap of compute and memory operations — the key to keeping tensor cores fed!