NVIDIA Tensor Core Evolution — LDGSTS → TMA → TMEM

Architecture Comparison

Three Generations of Data Movement

Each generation fundamentally changed how data reaches the tensor cores. The progression: threads compute addresses → hardware computes addresses → dedicated tensor memory eliminates register pressure entirely.

A100 / Ampere

LDGSTS Intrinsic

Async copy, but threads compute addresses

SM Architecture

⚡

Tensor Core

WMMA ops

📊

Registers

40-60% for tiles!

🧵

Threads

Compute addrs

📦

SMEM

192KB

↑ Data flow ↑

HBM2e

80GB @ 2 TB/s

40-60%

Regs for tiles

2 TB/s

HBM BW

// A100: LDGSTS — threads compute addresses
__shared__ half smem_A[TILE_M][TILE_K];
__shared__ half smem_B[TILE_K][TILE_N];

for (int k = 0; k < K; k += TILE_K) {
  // Each thread computes its own address!
  int row = blockIdx.y * TILE_M + threadIdx.y;
  int col = k + threadIdx.x;
  int gaddr = row * lda + col;  // ALU work
  
  // Async copy: data bypasses L1
  __pipeline_memcpy_async(
    &smem_A[threadIdx.y][threadIdx.x],
    &A[gaddr],
    sizeof(half)
  );
  __pipeline_commit();
  
  // Wait for all threads to finish
  __pipeline_wait_prior(0);
  __syncthreads();
  
  // Load from SMEM → Registers → TC
  wmma::load_matrix_sync(frag_a, smem_A, TILE_K);
  wmma::load_matrix_sync(frag_b, smem_B, TILE_N);
  wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);
}

⚠️ Problem: 32 threads × address math = wasted ALU cycles. Fragments live in registers.

H100 / Hopper

TMA Unit

Hardware computes all addresses

SM Architecture

⚡

Tensor Core

WGMMA async

🔥
TMA Unit
HW addr gen

📊

Registers

Much freer ✓

📦

SMEM

228KB

↑ TMA handles addressing ↑

HBM3

80GB @ 3.35 TB/s

~20%

Regs for tiles

3.35 TB/s

HBM BW

// H100: TMA — hardware computes addresses
__shared__ half smem_A[TILE_M][TILE_K];
__shared__ barrier bar;

// Tensor descriptor (setup once at kernel launch)
CUtensorMap tensorMap;
cuTensorMapEncodeTiled(&tensorMap,
  CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
  2,         // 2D tensor
  globalPtr, // Base address
  {M, K},    // Global dimensions
  {lda, 1},  // Strides
  {TILE_M, TILE_K},  // Box (tile) dims
  ...
);

for (int k = 0; k < K; k += TILE_K) {
  // Only thread 0 issues the load!
  if (threadIdx.x == 0) {
    cp.async.bulk.tensor.2d.shared(
      smem_A,      // Destination
      tensorMap,   // Descriptor
      {tile_x, k}, // Coordinates
      bar          // Barrier
    );
  }
  bar.arrive_and_wait();
  
  // WGMMA reads directly from SMEM!
  wgmma.mma_async(acc, smem_A, smem_B);
  wgmma.wait_group();
}

✓ Better: 1 thread issues, TMA computes 32 addresses in HW. WGMMA reads from SMEM.

B200 / Blackwell

TMEM + tcgen05

256KB dedicated tensor memory

SM Architecture

⚡

Tensor Core

tcgen05 ops

✨

TMEM

256KB dedicated

📊

Registers

100% FREE!

📦

SMEM

228KB staging

↑ Full async pipeline ↑

HBM3e

192GB @ 8 TB/s

Regs for tiles

8 TB/s

HBM BW

// B200: TMEM — tiles NEVER touch registers!
__shared__ half smem_A[TILE_M][TILE_K];
// TMEM is implicit: 256KB per SM, 64KB per warp

for (int k = 0; k < K; k += TILE_K) {
  // TMA loads to SMEM (same as H100)
  cp.async.bulk.tensor(smem_A, tensorMap, coords);
  
  // NEW: Copy SMEM → TMEM (bypasses regs!)
  tcgen05.cp.cta_group::1(
    tmem_addr_a,   // TMEM destination
    smem_A,        // SMEM source
    TILE_SIZE      // Bytes to copy
  );
  
  // MMA reads A,B directly from TMEM!
  tcgen05.mma.cta_group::1(
    tmem_acc,      // Accumulator in TMEM
    tmem_addr_a,   // A operand from TMEM
    tmem_addr_b    // B operand from TMEM
  );
  
  // Commit async ops when needed
  tcgen05.commit.cta_group::1();
  tcgen05.wait.cta_group::1();
}

// Final: TMEM → registers (only for output)
tcgen05.ld.cta_group::1(result, tmem_acc);

🚀 Best: HBM→SMEM→TMEM→TC. Registers 100% free. Full async overlap.

📐

LDGSTS: The Address Problem

With LDGSTS, every thread in a warp computes its own address: addr = base + (threadIdx * stride) + offset. That's 32 threads doing redundant integer math. The async copy is nice, but ALU cycles are wasted.

🎯

TMA: Hardware Addressing

TMA uses tensor descriptors that encode shape, stride, and swizzle. One thread issues cp.async.bulk.tensor; the TMA unit computes all 32 addresses in dedicated hardware, handling bank conflicts automatically.

✨

TMEM: Register Freedom

TMEM is a 256KB scratchpad dedicated to tensor operations. tcgen05.cp moves data SMEM→TMEM without touching registers. tcgen05.mma reads operands directly from TMEM. All 256 registers are free for compute logic.

Blackwell Deep Dive

What is TMEM?

256KB of on-chip scratchpad per SM — entirely separate from the register file. The final step in decoupling data movement from compute.

🚀 NEW IN BLACKWELL — TMEM

256KB

Dedicated tensor memory per SM

Key Instructions

tcgen05.cp — SMEM → TMEM copy
tcgen05.mma — Matrix multiply
tcgen05.commit — Async barrier
tcgen05.wait — Completion sync
tcgen05.ld/st — TMEM ↔ Registers

The Impact

Before (A100/H100):
40-60% of registers consumed by matrix tiles A, B, C

After (B200):
0% — TMEM handles all tile storage. Registers 100% free for scalar compute.

Async Pipeline

Operations are fully async. Issue tcgen05.cp and tcgen05.mma without blocking.

Multiple tiles in flight simultaneously. Only sync when you need results.

The complete Blackwell datapath: HBM → TMA → SMEM → tcgen05.cp → TMEM → tcgen05.mma → Tensor Core — registers never touched for tile data

Architecture Detail

Warpgroups: 128 Threads as One Unit

Blackwell tensor ops are warpgroup operations. Each warp owns a 64KB TMEM slice — implicit synchronization, no explicit barriers needed between warps.

Warp 0

Threads 0-31

Warp 1

Threads 32-63

Warp 2

Threads 64-95

Warp 3

Threads 96-127

↓ ↓ ↓ ↓

TMEM (256KB Total)

Slice 064KB

Slice 164KB

Slice 264KB

Slice 364KB

Each warp owns its TMEM slice — no cross-warp coordination needed. Lane slice addressing is implicit.

🔢

128 Threads = 1 Warpgroup

WGMMA and tcgen05 instructions operate on entire warpgroups (4 warps). The hardware handles synchronization — no explicit barriers between warps in the same warpgroup.

📦

Lane Slice Ownership

Each warp "owns" a 64KB slice of TMEM. Thread 0 in Warp 0 can only access Slice 0. This eliminates conflicts and enables zero-overhead addressing.

⚡

Producer-Consumer Overlap

One warpgroup loads tiles via TMA while another computes. tcgen05.cp and tcgen05.mma overlap perfectly — memory latency is completely hidden.

AMD CDNA Architecture

AMD's Different Direction

Rather than dedicated tensor memory, AMD invested in an enormous register file. Different tradeoffs, different optimization patterns required.

VGPR + AGPR Architecture

256

VGPRs

General purpose

256

AGPRs

Accumulators only

512

Total 32-bit registers per SIMD

MFMA Characteristics

⚠️ Synchronous — Blocks wavefront

⚠️ No TMA equivalent — Threads compute addresses

⚠️ Static registers — Division fixed at launch

⚠️ A/B ≠ C/D layouts — Reformatting needed

📊

AGPRs: Accumulator Registers

AMD's split: VGPRs for general computation, AGPRs exclusively for matrix accumulators. Data moves between them via v_accvgpr_write and v_accvgpr_read.

🔄

No Hardware Async

AMD lacks TMA and TMEM equivalents. Threads compute addresses manually. Global→VGPRs→LDS→AGPRs — more hops, software pipelining required via HipKittens patterns.

LDGSTS→TMA→TMEM

Three Generations of Data Movement

LDGSTS Intrinsic

TMA Unit

TMEM + tcgen05

What is TMEM?

Warpgroups: 128 Threads as One Unit

AMD's Different Direction

Memory Layouts: The Hidden Complexity

NVIDIA Core Matrices

AMD MFMA A/B Inputs

AMD MFMA C/D Accumulators