A Visual Guide to GPU Performance Optimization
How fully utilized is the GPU? The foundation of performance.
Example: 48 active warps / 64 max warps = 75% occupancy
Higher occupancy means more warps available to hide memory latency. When one warp stalls waiting for data, the scheduler can switch to another ready warp. Low occupancy = fewer warps to switch to = cores sit idle waiting. But 100% isn't always optimal — sometimes using more registers for fewer warps gives better per-thread performance!
How GPUs hide memory latency by switching between warps
Memory takes ~400-800 cycles. But with enough warps, the GPU is always doing useful work. This is why occupancy matters — more resident warps = more opportunities to hide latency = higher throughput.
Fewer threads doing more work each — trading occupancy for efficiency
Coarsening reduces overhead but also reduces occupancy (fewer warps). You're trading latency-hiding ability for per-thread efficiency. Profile to find the optimal balance — typically 2-8 elements per thread works well.
When threads in a warp take different execution paths
All 32 threads in a warp share one instruction pointer. When they diverge on a branch, the GPU must execute both paths serially while masking inactive threads. Design kernels so threads within a warp take the same branch whenever possible.
How memory access patterns determine bandwidth efficiency
GPU memory is interleaved across multiple DRAM chips. Consecutive addresses go to different chips, allowing parallel access. When threads access consecutive addresses, all chips work together → maximum bandwidth. When threads access strided addresses, you're only using one chip at a time → wasted bandwidth.
This same principle applies to NVMe storage! Scattered 4KB reads become separate I/O commands with 10-100μs latency each. Sequential reads can be merged into large transfers, maximizing SSD throughput. GPUDirect Storage benefits most when access patterns are coalesced.
Who can see/share each memory type (NVIDIA/CUDA-style)
| Memory Type | Scope | Notes |
|---|---|---|
| Registers | Per-thread | Fastest. Private to each thread. Spills go to local memory. |
| Shared Memory | Per-block (CTA) | On-chip SRAM. Shared by threads in same block. ~100× faster than global. |
| L1 Cache | Per-SM | Shared by warps on same SM. Not coherent across SMs. |
| L2 Cache | Device-wide | Shared by all SMs. Caches global memory accesses. |
| Global Memory (HBM/GDDR) | Device-wide (all kernels) | Main GPU memory. Persistent across kernels. Coalescing critical! |
| Unified Memory (UVM) | System-wide (CPU + GPU) | Automatically migrates between CPU/GPU. Convenient but has overhead. |
| Host Pinned Memory | CPU memory, GPU DMA access | Page-locked CPU memory. Enables fast H2D/D2H transfers. |
| Storage (NVMe/SSD) | System-wide | Via GPUDirect Storage. 10-100μs latency. Coalescing even more critical! |