A choreographed journey through the complete GPU architecture stack — from high-level PyTorch operations down to tensor cores and matrix units
Watch tensors flow from Python through CUDA to silicon
Comprehensive overviews that tie the entire stack together
High-level APIs, parallelism strategies, and training orchestration
cuBLAS, cuDNN, Flash Attention, and optimized kernels
NCCL, RCCL, AllReduce, AllGather, and inter-GPU communication
SMs, warps, memory hierarchy, and execution model
The silicon that does the actual matrix math
18 additional visualizations covering tensor cores, matrix units, TMEM, AGPRs, and more
⚡ Open Tensor Core Library →