Featured — Start Here
01
NVIDIA Evolution: LDGSTS → TMA → TMEM
Complete comparison of A100, H100, and B200 with code examples
02
HipKittens: AMD Optimization Patterns
8-Wave Ping-Pong, 4-Wave Interleave, and XCD Grouping explained
NVIDIA Deep Dives
16
NVIDIA Tensor Cores — Deep Architecture Guide
Complete technical breakdown from Volta to Blackwell with PTX examples
03
NVIDIA Tensor Core Timeline
Volta → Ampere → Hopper → Blackwell evolution
04
Feeding the Tensor Cores — Complete Guide
Comprehensive breakdown of data movement strategies
05
Warpgroups & TMEM Explained
128 threads working as one unit, TMEM slice ownership
10
TMEM & Register Pressure
How TMEM eliminates register pressure for tensor tiles
AMD Deep Dives
17
AMD Matrix Cores — Deep Architecture Guide
MI300X MFMA instructions, AGPRs, and CDNA3 internals
06
AMD MFMA Scheduling Deep Dive
Understanding MFMA blocking behavior and scheduling challenges
09
AMD AGPR Restrictions
Accumulator registers and their unique constraints
Comparisons
18
NVIDIA vs AMD — 2026 Architecture Comparison
Head-to-head: Blackwell B200 vs MI300X tensor/matrix cores
07
Wave Specialization: NVIDIA vs AMD
Why NVIDIA patterns don't work on AMD hardware
08
Memory Layouts & Swizzling
NVIDIA XOR swizzle vs AMD shape-specific layouts
12
NVIDIA vs AMD Complete Comparison
Side-by-side architectural comparison
Animations & Visuals
11
Tensor Architecture Visual
Complete visual breakdown of tensor core architecture
13
Tensor Data Flow Animation
Animated visualization of data movement through the SM
14
TMEM Flow Animation
Blackwell TMEM data path visualization
15
Hologram — Original Concept
The original hologram visualization