From silicon interconnects to training loops — a unified visual guide to multi-GPU deep learning infrastructure. Hardware, communication, parallelism strategies, and production configurations.
The physical foundation — how GPUs connect to each other at the silicon level
NCCL & RCCL — software that orchestrates GPU collective operations
How to distribute model and data across GPUs for training
How all layers work together in production training
| Model | Parameters | GPUs | DP | TP | PP | MFU | Time |
|---|---|---|---|---|---|---|---|
| GPT-3 | 175B | 1,024 V100 | 64 | 8 | 2 | 46% | 34 days |
| LLaMA 2 70B | 70B | 2,048 A100 | 256 | 8 | 1 | 55% | 21 days |
| Llama 3.1 405B | 405B | 16,384 H100 | 256 | 8 | 8 | 38% | 30.84 days |
| DeepSeek-V3 | 671B MoE | 2,048 H800 | 2048 | — | 16 | 52% | 55 days |