Main A: GPU B: NVMe C: Production

CUDA Execution Concepts

A Visual Guide to GPU Performance Optimization

Learning Path

1
📊

Occupancy

How fully utilized is the GPU? The foundation of performance.

Occupancy = Active Warps per SM Maximum Warps per SM

Example: 48 active warps / 64 max warps = 75% occupancy

Streaming Multiprocessor (SM) — 64 Warp Slots 0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 Active Warp (48) Empty Slot (16) = 75% Occupancy

What Limits Occupancy?

🔢 Registers per Thread

SM has 65,536 registers total If kernel uses 64 registers/thread: Max threads = 65,536 ÷ 64 = 1,024 threads = 1,024 ÷ 32 = 32 warps (only 50% occupancy!)

💾 Shared Memory per Block

SM has 48KB shared memory If block uses 24KB shared mem: Max blocks/SM = 48KB ÷ 24KB = 2 blocks If block = 512 threads → 32 warps (50%)
💡

Why Occupancy Matters

Higher occupancy means more warps available to hide memory latency. When one warp stalls waiting for data, the scheduler can switch to another ready warp. Low occupancy = fewer warps to switch to = cores sit idle waiting. But 100% isn't always optimal — sometimes using more registers for fewer warps gives better per-thread performance!

2
🔄

Warp Scheduling

How GPUs hide memory latency by switching between warps

CUDA Cores (execute in parallel) Core Core Core Core Core Core Core Core t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 Warp A executing (COMPUTE) Warp A executing (COMPUTE) 🔴 Warp A STALLED — waiting for memory still waiting... still waiting... ✓ Data arrived! Warp A resumes ↳ Scheduler switches to Warp B! Warp B executing (COMPUTE) Warp B executing (COMPUTE) Warp C ready and waiting... Warp C executes when B stalls ⏱ ~400 cycles memory latency Warp Scheduler Scheduler Logic Ready Warp Pool Warp A Warp B Warp C Warp D ···more warps··· Zero-cost switching! No context save/restore (all state in registers)
🚀

Latency Hiding = Throughput

Memory takes ~400-800 cycles. But with enough warps, the GPU is always doing useful work. This is why occupancy matters — more resident warps = more opportunities to hide latency = higher throughput.

3
🧵

Thread Coarsening

Fewer threads doing more work each — trading occupancy for efficiency

BEFORE Fine-Grained: 1 Thread = 1 Element

8 threads launched for 8 elements T0 T1 T2 T3 T4 T5 T6 T7 [0] [1] [2] [3] [4] [5] [6] [7] Overhead per thread: ❌ Register allocation & scheduling overhead ❌ Instruction fetch redundancy ❌ Address calculation per thread int i = blockIdx.x * blockDim.x + threadIdx.x; out[i] = process(in[i]); // 1 element

AFTER Coarsened: 1 Thread = 4 Elements

Only 2 threads launched (4× coarsening factor) Thread 0 Thread 1 [0] [1] [2] [3] [4] [5] [6] [7] Amortized overhead: ✓ 4× less thread management ✓ Better instruction cache reuse ✓ More registers available per thread int base = threadIdx.x * COARSEN_FACTOR; for(int j=0; j<4; j++) process(base+j);
The Occupancy vs Efficiency Trade-off 1 elem/thread High occupancy High overhead Sweet Spot: 2-8 elem Balanced 100+ elem/thread Low occupancy Can't hide latency
⚓️

The Trade-off

Coarsening reduces overhead but also reduces occupancy (fewer warps). You're trading latency-hiding ability for per-thread efficiency. Profile to find the optimal balance — typically 2-8 elements per thread works well.

4

Warp Divergence

When threads in a warp take different execution paths

0 1 2 3 ··· 23 24 25 ··· 31 ··· ··· All 32 active ✓ ··· masked off Pass 1: Path A (threads 0-23) masked off ··· Pass 2: Path B (threads 24-31) ··· ··· Reconverged ✓ if (tid < 24) // Path A else // Path B → 2 serial passes! Divergent warps execute both paths sequentially → 50% efficiency loss!
⚠️

SIMT Execution Model

All 32 threads in a warp share one instruction pointer. When they diverge on a branch, the GPU must execute both paths serially while masking inactive threads. Design kernels so threads within a warp take the same branch whenever possible.

5
💾

Memory Coalescing

How memory access patterns determine bandwidth efficiency

Row-Major vs Column-Major Access

GOOD Row-Major Access (Coalesced)

Threads access consecutive memory addresses T0 T1 T2 T3 T4 T5 T6 T7 Matrix in memory (row-major layout): [0,0] [0,1] [0,2] [0,3] [0,4] [0,5] [0,6] [0,7] [1,0] 1 × 128-byte transaction 100% bandwidth utilization A[row][threadIdx.x] // consecutive!

BAD Column-Major Access (Strided)

Threads access addresses stride apart T0 T1 T2 T3 T4 T5 T6 T7 Matrix in memory (row-major layout): [0,0] [1,0] [2,0] + 4 more... 8 transactions = 12.5% efficiency A[threadIdx.x][col] // stride = row_width!

Why This Happens: DRAM Chip Architecture

GPU Memory = Multiple DRAM Chips in Parallel Memory Controller 512-bit Memory Bus (8 × 64-bit channels) Chip 0 64-bit Addr 0-7 Addr 64+ Chip 1 64-bit Addr 8-15 Chip 2 64-bit Addr 16-23 Chip 3 64-bit Addr 24-31 Chip 4 64-bit Addr 32-39 Chip 5 64-bit Addr 40-47 Chip 6 64-bit Addr 48-55 Chip 7 64-bit Addr 56-63 ✓ Coalesced Access (addresses 0-63) • All 8 chips activate simultaneously • Each chip provides 8 bytes (64 bits) • Total: 8 × 8 = 64 bytes in ONE transaction ✔ Strided Access (addresses 0, 512, 1024...) • Only Chip 0 is useful per transaction • Other 7 chips fetch wasted data • Need 8 separate transactions = 8× slower
🚀

The Key Insight

GPU memory is interleaved across multiple DRAM chips. Consecutive addresses go to different chips, allowing parallel access. When threads access consecutive addresses, all chips work together → maximum bandwidth. When threads access strided addresses, you're only using one chip at a time → wasted bandwidth.

💾

GPU-Storage Connection

This same principle applies to NVMe storage! Scattered 4KB reads become separate I/O commands with 10-100μs latency each. Sequential reads can be merged into large transfers, maximizing SSD throughput. GPUDirect Storage benefits most when access patterns are coalesced.

6
📋

GPU Memory Scope Reference

Who can see/share each memory type (NVIDIA/CUDA-style)

Memory Type Scope Notes
Registers Per-thread Fastest. Private to each thread. Spills go to local memory.
Shared Memory Per-block (CTA) On-chip SRAM. Shared by threads in same block. ~100× faster than global.
L1 Cache Per-SM Shared by warps on same SM. Not coherent across SMs.
L2 Cache Device-wide Shared by all SMs. Caches global memory accesses.
Global Memory (HBM/GDDR) Device-wide (all kernels) Main GPU memory. Persistent across kernels. Coalescing critical!
Unified Memory (UVM) System-wide (CPU + GPU) Automatically migrates between CPU/GPU. Convenient but has overhead.
Host Pinned Memory CPU memory, GPU DMA access Page-locked CPU memory. Enables fast H2D/D2H transfers.
Storage (NVMe/SSD) System-wide Via GPUDirect Storage. 10-100μs latency. Coalescing even more critical!