Appendix A.9: CUDA Visual Guide

1

📊

Occupancy

How fully utilized is the GPU? The foundation of performance.

Example: 48 active warps / 64 max warps = 75% occupancy

What Limits Occupancy?

🔢 Registers per Thread

💾 Shared Memory per Block

💡

Why Occupancy Matters

Higher occupancy means more warps available to hide memory latency. When one warp stalls waiting for data, the scheduler can switch to another ready warp. Low occupancy = fewer warps to switch to = cores sit idle waiting. But 100% isn't always optimal — sometimes using more registers for fewer warps gives better per-thread performance!

2

🔄

Warp Scheduling

How GPUs hide memory latency by switching between warps

🚀

Latency Hiding = Throughput

Memory takes ~400-800 cycles. But with enough warps, the GPU is always doing useful work. This is why occupancy matters — more resident warps = more opportunities to hide latency = higher throughput.

3

🧵

Thread Coarsening

Fewer threads doing more work each — trading occupancy for efficiency

BEFORE Fine-Grained: 1 Thread = 1 Element

AFTER Coarsened: 1 Thread = 4 Elements

⚓️

The Trade-off

Coarsening reduces overhead but also reduces occupancy (fewer warps). You're trading latency-hiding ability for per-thread efficiency. Profile to find the optimal balance — typically 2-8 elements per thread works well.

4

⚡

Warp Divergence

When threads in a warp take different execution paths

⚠️

SIMT Execution Model

All 32 threads in a warp share one instruction pointer. When they diverge on a branch, the GPU must execute both paths serially while masking inactive threads. Design kernels so threads within a warp take the same branch whenever possible.

5

💾

Memory Coalescing

How memory access patterns determine bandwidth efficiency

Row-Major vs Column-Major Access

GOOD Row-Major Access (Coalesced)

BAD Column-Major Access (Strided)

Why This Happens: DRAM Chip Architecture

🚀

The Key Insight

GPU memory is interleaved across multiple DRAM chips. Consecutive addresses go to different chips, allowing parallel access. When threads access consecutive addresses, all chips work together → maximum bandwidth. When threads access strided addresses, you're only using one chip at a time → wasted bandwidth.

💾

GPU-Storage Connection

This same principle applies to NVMe storage! Scattered 4KB reads become separate I/O commands with 10-100μs latency each. Sequential reads can be merged into large transfers, maximizing SSD throughput. GPUDirect Storage benefits most when access patterns are coalesced.

6

📋

GPU Memory Scope Reference

Who can see/share each memory type (NVIDIA/CUDA-style)

Memory Type	Scope	Notes
Registers	Per-thread	Fastest. Private to each thread. Spills go to local memory.
Shared Memory	Per-block (CTA)	On-chip SRAM. Shared by threads in same block. ~100× faster than global.
L1 Cache	Per-SM	Shared by warps on same SM. Not coherent across SMs.
L2 Cache	Device-wide	Shared by all SMs. Caches global memory accesses.
Global Memory (HBM/GDDR)	Device-wide (all kernels)	Main GPU memory. Persistent across kernels. Coalescing critical!
Unified Memory (UVM)	System-wide (CPU + GPU)	Automatically migrates between CPU/GPU. Convenient but has overhead.
Host Pinned Memory	CPU memory, GPU DMA access	Page-locked CPU memory. Enables fast H2D/D2H transfers.
Storage (NVMe/SSD)	System-wide	Via GPUDirect Storage. 10-100μs latency. Coalescing even more critical!

CUDA Execution Concepts

Learning Path