The LLM Symphony — Complete GPU Architecture Curriculum

🐍

PyTorch & Distributed Training

High-level APIs, parallelism strategies, and training orchestration

01

Parallelism Strategies

Data, Model, Pipeline, Tensor, Expert parallelism

Interactive

→

02

FSDP Complete Cycle

FlatParameter sharding, AllGather, ReduceScatter

New FSDP

→

03

Distributed Training Animated

Visual walkthrough of distributed training concepts

Animated

→

04

Distributed Training 2026

State-of-the-art distributed training guide

→

⚡

Operators & Kernels

cuBLAS, cuDNN, Flash Attention, and optimized kernels

05

Flash Attention Animated

Tiled attention with online softmax visualization

Animated

→

06

Flash Attention 2026

Complete Flash Attention deep dive

→

07

BLAS Communication

Matrix operations and communication patterns

→

08

Sparse Matrix Animated

Sparse computation patterns and optimization

Animated

→

09

cuDNN vs MIOpen

NVIDIA vs AMD deep learning libraries

NVIDIA AMD

→

10

Compilers 2026

NVCC, HIP, and ML compilers

→

📡

Communication & Collectives

NCCL, RCCL, AllReduce, AllGather, and inter-GPU communication

11

NCCL vs RCCL

Collective communication libraries compared

NVIDIA AMD

→

12

GPU Interconnect

NVLink, NVSwitch, Infinity Fabric deep dive

→

13

GPU Interconnect 2026

Latest interconnect technologies

→

🔧

GPU Architecture

SMs, warps, memory hierarchy, and execution model

14

GPU Complete 2026

Comprehensive GPU architecture guide

NVIDIA AMD

→

15

Warp Primitives Animated

Warp shuffle, vote, and collective operations

Animated

→

16

Warp Primitives 2026

Complete warp-level programming guide

→

17

Memory Hierarchy

Registers, shared memory, L2, HBM

→

18

Chiplet MCM Architecture

Multi-chip module and chiplet designs

→

19

CUDA vs HIP

Programming models compared

NVIDIA AMD

→

20

GPU State of Art 2026

Latest GPU architectures overview

→

💎

Tensor Cores & Matrix Units

The silicon that does the actual matrix math

21

Sparse Matrix 2026

Structured sparsity and sparse tensor cores

→

22

Training Visual 2026

End-to-end training visualization

→

23

BLAS Communication 2026

Matrix ops at the silicon level

→

PyTorch to Silicon

The Data Flow Journey

Featured — Start Here

PyTorch & Distributed Training

Operators & Kernels

Communication & Collectives

GPU Architecture

Tensor Cores & Matrix Units

🔬 Deep Dive: NVIDIA & AMD Tensor Architecture