tcgen05.cp copies data from Shared Memory directly to TMEM, bypassing registers entirely. Then tcgen05.mma reads operands straight from TMEM for matrix multiply.
Copy directly from Shared Memory to TMEM. The key innovation: this instruction completely bypasses the register file.
Matrix multiply that reads A and B operands straight from TMEM. No register staging required!
Threads issue the op and immediately move on to other work. The data movement and computation happen in the background. When you need results, an explicit commit drains the pipeline. This enables overlap of compute and memory operations — the key to keeping tensor cores fed!