The evolution from manual address computation to fully async tensor pipelines — A100 (Ampere), H100 (Hopper), and B200 (Blackwell) compared
Each generation fundamentally changed how data reaches the tensor cores. The progression: threads compute addresses → hardware computes addresses → dedicated tensor memory eliminates register pressure entirely.
// A100: LDGSTS — threads compute addresses __shared__ half smem_A[TILE_M][TILE_K]; __shared__ half smem_B[TILE_K][TILE_N]; for (int k = 0; k < K; k += TILE_K) { // Each thread computes its own address! int row = blockIdx.y * TILE_M + threadIdx.y; int col = k + threadIdx.x; int gaddr = row * lda + col; // ALU work // Async copy: data bypasses L1 __pipeline_memcpy_async( &smem_A[threadIdx.y][threadIdx.x], &A[gaddr], sizeof(half) ); __pipeline_commit(); // Wait for all threads to finish __pipeline_wait_prior(0); __syncthreads(); // Load from SMEM → Registers → TC wmma::load_matrix_sync(frag_a, smem_A, TILE_K); wmma::load_matrix_sync(frag_b, smem_B, TILE_N); wmma::mma_sync(frag_c, frag_a, frag_b, frag_c); }
// H100: TMA — hardware computes addresses __shared__ half smem_A[TILE_M][TILE_K]; __shared__ barrier bar; // Tensor descriptor (setup once at kernel launch) CUtensorMap tensorMap; cuTensorMapEncodeTiled(&tensorMap, CU_TENSOR_MAP_DATA_TYPE_FLOAT16, 2, // 2D tensor globalPtr, // Base address {M, K}, // Global dimensions {lda, 1}, // Strides {TILE_M, TILE_K}, // Box (tile) dims ... ); for (int k = 0; k < K; k += TILE_K) { // Only thread 0 issues the load! if (threadIdx.x == 0) { cp.async.bulk.tensor.2d.shared( smem_A, // Destination tensorMap, // Descriptor {tile_x, k}, // Coordinates bar // Barrier ); } bar.arrive_and_wait(); // WGMMA reads directly from SMEM! wgmma.mma_async(acc, smem_A, smem_B); wgmma.wait_group(); }
// B200: TMEM — tiles NEVER touch registers! __shared__ half smem_A[TILE_M][TILE_K]; // TMEM is implicit: 256KB per SM, 64KB per warp for (int k = 0; k < K; k += TILE_K) { // TMA loads to SMEM (same as H100) cp.async.bulk.tensor(smem_A, tensorMap, coords); // NEW: Copy SMEM → TMEM (bypasses regs!) tcgen05.cp.cta_group::1( tmem_addr_a, // TMEM destination smem_A, // SMEM source TILE_SIZE // Bytes to copy ); // MMA reads A,B directly from TMEM! tcgen05.mma.cta_group::1( tmem_acc, // Accumulator in TMEM tmem_addr_a, // A operand from TMEM tmem_addr_b // B operand from TMEM ); // Commit async ops when needed tcgen05.commit.cta_group::1(); tcgen05.wait.cta_group::1(); } // Final: TMEM → registers (only for output) tcgen05.ld.cta_group::1(result, tmem_acc);
With LDGSTS, every thread in a warp computes its own address: addr = base + (threadIdx * stride) + offset. That's 32 threads doing redundant integer math. The async copy is nice, but ALU cycles are wasted.
TMA uses tensor descriptors that encode shape, stride, and swizzle. One thread issues cp.async.bulk.tensor; the TMA unit computes all 32 addresses in dedicated hardware, handling bank conflicts automatically.
TMEM is a 256KB scratchpad dedicated to tensor operations. tcgen05.cp moves data SMEM→TMEM without touching registers. tcgen05.mma reads operands directly from TMEM. All 256 registers are free for compute logic.
256KB of on-chip scratchpad per SM — entirely separate from the register file. The final step in decoupling data movement from compute.
tcgen05.cp — SMEM → TMEM copytcgen05.mma — Matrix multiplytcgen05.commit — Async barriertcgen05.wait — Completion synctcgen05.ld/st — TMEM ↔ Registers
tcgen05.cp and tcgen05.mma without blocking.Blackwell tensor ops are warpgroup operations. Each warp owns a 64KB TMEM slice — implicit synchronization, no explicit barriers needed between warps.
WGMMA and tcgen05 instructions operate on entire warpgroups (4 warps). The hardware handles synchronization — no explicit barriers between warps in the same warpgroup.
Each warp "owns" a 64KB slice of TMEM. Thread 0 in Warp 0 can only access Slice 0. This eliminates conflicts and enables zero-overhead addressing.
One warpgroup loads tiles via TMA while another computes. tcgen05.cp and tcgen05.mma overlap perfectly — memory latency is completely hidden.
Rather than dedicated tensor memory, AMD invested in an enormous register file. Different tradeoffs, different optimization patterns required.
AMD's split: VGPRs for general computation, AGPRs exclusively for matrix accumulators. Data moves between them via v_accvgpr_write and v_accvgpr_read.
AMD lacks TMA and TMEM equivalents. Threads compute addresses manually. Global→VGPRs→LDS→AGPRs — more hops, software pipelining required via HipKittens patterns.
NVIDIA uses a universal 16×16 building block with XOR swizzle. AMD's MFMA has shape-specific layouts that differ between inputs and outputs.
Universal 16×16 building blockSingle XOR swizzle pattern
Input matrices A/BShape-specific layout
Accumulator C/DDifferent from A/B!
Without careful swizzling, shared memory bank conflicts reduce effective bandwidth by 4-8×