TMEM Data Flow — tcgen05.cp & tcgen05.mma Animation

Matrix A

Matrix B

📦

Shared Memory (SMEM)

Matrix tiles loaded here first

↓

tcgen05.cp

SMEM → TMEM (bypasses registers!)

👥

Threads

Issue & move on

ASYNC

Working on
other tasks!

🔥

TMEM

256KB Dedicated Tensor Memory

Matrix A & B Operands Live Here!

A tile

B tile

📝

Registers

Now FREE!

✅ FOR COMPUTE

BYPASSED!
No tiles here

↓

tcgen05.mma

Reads A,B from TMEM → Compute

🧮

Tensor Core

Matrix Multiply-Accumulate

D = A × B + C

⚡ tcgen05.cp

Copy directly from Shared Memory to TMEM. The key innovation: this instruction completely bypasses the register file.

🚀

Registers are no longer consumed by matrix tiles!
They're FREE for addresses, loop counters, and actual computation.

🧮 tcgen05.mma

Matrix multiply that reads A and B operands straight from TMEM. No register staging required!

⚡

Async execution: issue the op and immediately move on.
When results needed, explicit commit drains the pipeline.

⚡ The Async Advantage

Threads issue the op and immediately move on to other work. The data movement and computation happen in the background. When you need results, an explicit commit drains the pipeline. This enables overlap of compute and memory operations — the key to keeping tensor cores fed!