NVIDIA Tensor Core Evolution

LDGSTSTMATMEM

The evolution from manual address computation to fully async tensor pipelines — A100 (Ampere), H100 (Hopper), and B200 (Blackwell) compared

Architecture Comparison

Three Generations of Data Movement

Each generation fundamentally changed how data reaches the tensor cores. The progression: threads compute addresses → hardware computes addresses → dedicated tensor memory eliminates register pressure entirely.

A100 / Ampere

LDGSTS Intrinsic

Async copy, but threads compute addresses
SM Architecture
Tensor Core
WMMA ops
📊
Registers
40-60% for tiles!
🧵
Threads
Compute addrs
📦
SMEM
192KB
↑ Data flow ↑
HBM2e
80GB @ 2 TB/s
40-60%
Regs for tiles
2 TB/s
HBM BW
// A100: LDGSTS — threads compute addresses
__shared__ half smem_A[TILE_M][TILE_K];
__shared__ half smem_B[TILE_K][TILE_N];

for (int k = 0; k < K; k += TILE_K) {
  // Each thread computes its own address!
  int row = blockIdx.y * TILE_M + threadIdx.y;
  int col = k + threadIdx.x;
  int gaddr = row * lda + col;  // ALU work
  
  // Async copy: data bypasses L1
  __pipeline_memcpy_async(
    &smem_A[threadIdx.y][threadIdx.x],
    &A[gaddr],
    sizeof(half)
  );
  __pipeline_commit();
  
  // Wait for all threads to finish
  __pipeline_wait_prior(0);
  __syncthreads();
  
  // Load from SMEM → Registers → TC
  wmma::load_matrix_sync(frag_a, smem_A, TILE_K);
  wmma::load_matrix_sync(frag_b, smem_B, TILE_N);
  wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);
}
⚠️ Problem: 32 threads × address math = wasted ALU cycles. Fragments live in registers.
H100 / Hopper

TMA Unit

Hardware computes all addresses
SM Architecture
Tensor Core
WGMMA async
🔥
TMA Unit
HW addr gen
📊
Registers
Much freer ✓
📦
SMEM
228KB
↑ TMA handles addressing ↑
HBM3
80GB @ 3.35 TB/s
~20%
Regs for tiles
3.35 TB/s
HBM BW
// H100: TMA — hardware computes addresses
__shared__ half smem_A[TILE_M][TILE_K];
__shared__ barrier bar;

// Tensor descriptor (setup once at kernel launch)
CUtensorMap tensorMap;
cuTensorMapEncodeTiled(&tensorMap,
  CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
  2,         // 2D tensor
  globalPtr, // Base address
  {M, K},    // Global dimensions
  {lda, 1},  // Strides
  {TILE_M, TILE_K},  // Box (tile) dims
  ...
);

for (int k = 0; k < K; k += TILE_K) {
  // Only thread 0 issues the load!
  if (threadIdx.x == 0) {
    cp.async.bulk.tensor.2d.shared(
      smem_A,      // Destination
      tensorMap,   // Descriptor
      {tile_x, k}, // Coordinates
      bar          // Barrier
    );
  }
  bar.arrive_and_wait();
  
  // WGMMA reads directly from SMEM!
  wgmma.mma_async(acc, smem_A, smem_B);
  wgmma.wait_group();
}
Better: 1 thread issues, TMA computes 32 addresses in HW. WGMMA reads from SMEM.
B200 / Blackwell

TMEM + tcgen05

256KB dedicated tensor memory
SM Architecture
Tensor Core
tcgen05 ops
TMEM
256KB dedicated
📊
Registers
100% FREE!
📦
SMEM
228KB staging
↑ Full async pipeline ↑
HBM3e
192GB @ 8 TB/s
0%
Regs for tiles
8 TB/s
HBM BW
// B200: TMEM — tiles NEVER touch registers!
__shared__ half smem_A[TILE_M][TILE_K];
// TMEM is implicit: 256KB per SM, 64KB per warp

for (int k = 0; k < K; k += TILE_K) {
  // TMA loads to SMEM (same as H100)
  cp.async.bulk.tensor(smem_A, tensorMap, coords);
  
  // NEW: Copy SMEM → TMEM (bypasses regs!)
  tcgen05.cp.cta_group::1(
    tmem_addr_a,   // TMEM destination
    smem_A,        // SMEM source
    TILE_SIZE      // Bytes to copy
  );
  
  // MMA reads A,B directly from TMEM!
  tcgen05.mma.cta_group::1(
    tmem_acc,      // Accumulator in TMEM
    tmem_addr_a,   // A operand from TMEM
    tmem_addr_b    // B operand from TMEM
  );
  
  // Commit async ops when needed
  tcgen05.commit.cta_group::1();
  tcgen05.wait.cta_group::1();
}

// Final: TMEM → registers (only for output)
tcgen05.ld.cta_group::1(result, tmem_acc);
🚀 Best: HBM→SMEM→TMEM→TC. Registers 100% free. Full async overlap.
📐
LDGSTS: The Address Problem

With LDGSTS, every thread in a warp computes its own address: addr = base + (threadIdx * stride) + offset. That's 32 threads doing redundant integer math. The async copy is nice, but ALU cycles are wasted.

🎯
TMA: Hardware Addressing

TMA uses tensor descriptors that encode shape, stride, and swizzle. One thread issues cp.async.bulk.tensor; the TMA unit computes all 32 addresses in dedicated hardware, handling bank conflicts automatically.

TMEM: Register Freedom

TMEM is a 256KB scratchpad dedicated to tensor operations. tcgen05.cp moves data SMEM→TMEM without touching registers. tcgen05.mma reads operands directly from TMEM. All 256 registers are free for compute logic.

Blackwell Deep Dive

What is TMEM?

256KB of on-chip scratchpad per SM — entirely separate from the register file. The final step in decoupling data movement from compute.

🚀 NEW IN BLACKWELL — TMEM
256KB
Dedicated tensor memory per SM
Key Instructions
tcgen05.cp — SMEM → TMEM copy
tcgen05.mma — Matrix multiply
tcgen05.commit — Async barrier
tcgen05.wait — Completion sync
tcgen05.ld/st — TMEM ↔ Registers
The Impact
Before (A100/H100):
40-60% of registers consumed by matrix tiles A, B, C

After (B200):
0% — TMEM handles all tile storage. Registers 100% free for scalar compute.
Async Pipeline
Operations are fully async. Issue tcgen05.cp and tcgen05.mma without blocking.

Multiple tiles in flight simultaneously. Only sync when you need results.
The complete Blackwell datapath: HBM → TMA → SMEM → tcgen05.cp → TMEM → tcgen05.mma → Tensor Core — registers never touched for tile data
Architecture Detail

Warpgroups: 128 Threads as One Unit

Blackwell tensor ops are warpgroup operations. Each warp owns a 64KB TMEM slice — implicit synchronization, no explicit barriers needed between warps.

Warp 0
Threads 0-31
Warp 1
Threads 32-63
Warp 2
Threads 64-95
Warp 3
Threads 96-127
↓ ↓ ↓ ↓
TMEM (256KB Total)
Slice 064KB
Slice 164KB
Slice 264KB
Slice 364KB
Each warp owns its TMEM slice — no cross-warp coordination needed. Lane slice addressing is implicit.
🔢
128 Threads = 1 Warpgroup

WGMMA and tcgen05 instructions operate on entire warpgroups (4 warps). The hardware handles synchronization — no explicit barriers between warps in the same warpgroup.

📦
Lane Slice Ownership

Each warp "owns" a 64KB slice of TMEM. Thread 0 in Warp 0 can only access Slice 0. This eliminates conflicts and enables zero-overhead addressing.

Producer-Consumer Overlap

One warpgroup loads tiles via TMA while another computes. tcgen05.cp and tcgen05.mma overlap perfectly — memory latency is completely hidden.

AMD CDNA Architecture

AMD's Different Direction

Rather than dedicated tensor memory, AMD invested in an enormous register file. Different tradeoffs, different optimization patterns required.

VGPR + AGPR Architecture
256
VGPRs
General purpose
256
AGPRs
Accumulators only
512
Total 32-bit registers per SIMD
MFMA Characteristics
⚠️ Synchronous — Blocks wavefront
⚠️ No TMA equivalent — Threads compute addresses
⚠️ Static registers — Division fixed at launch
⚠️ A/B ≠ C/D layouts — Reformatting needed
📊
AGPRs: Accumulator Registers

AMD's split: VGPRs for general computation, AGPRs exclusively for matrix accumulators. Data moves between them via v_accvgpr_write and v_accvgpr_read.

🔄
No Hardware Async

AMD lacks TMA and TMEM equivalents. Threads compute addresses manually. Global→VGPRs→LDS→AGPRs — more hops, software pipelining required via HipKittens patterns.

Memory Organization

Memory Layouts: The Hidden Complexity

NVIDIA uses a universal 16×16 building block with XOR swizzle. AMD's MFMA has shape-specific layouts that differ between inputs and outputs.

NVIDIA Core Matrices

Universal 16×16 building blockSingle XOR swizzle pattern

AMD MFMA A/B Inputs

Input matrices A/BShape-specific layout

AMD MFMA C/D Accumulators

Accumulator C/DDifferent from A/B!

⚠️

Without careful swizzling, shared memory bank conflicts reduce effective bandwidth by 4-8×