How model weights and data are distributed across GPUs
Splits the dataset into smaller subsets across multiple GPUs. Each GPU trains a complete replica of the model on its data subset, then gradients are synchronized.
Divides the model itself across multiple GPUs. Different GPUs handle different layers or blocks of the model, passing activations between stages.
Combines model parallelism with micro-batching. Mini-batches flow through stages in a pipeline fashion, reducing idle time (bubble) compared to naive model parallelism.
Splits individual tensor operations across GPUs. Matrix multiplications are partitioned column-wise or row-wise, requiring high-bandwidth interconnect (NVLink).
Routes tokens to specific expert FFN blocks based on a gating function. Only a subset of experts process each token, enabling massive parameter scaling efficiently.
Zero Redundancy Optimizer partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across GPUs, dramatically reducing memory footprint.
Visual comparison of parallelism strategies
Model Weights
Replicated on all GPUs
Data
Split across GPUs
Model Weights
Layers split
Data
Same batch on all
Model Weights
Model × Data parallel
Data
2D parallelism
Model Weights
Unique experts
Data
Routed tokens
Model Weights
3D parallelism
Data
Expert × Model × Data
Specialized techniques for specific use cases
Combines Data + Tensor + Pipeline parallelism for maximum scale. Used by Megatron-DeepSpeed for trillion-parameter models.
Moves optimizer states or parameters to CPU RAM or NVMe when not needed, enabling larger models on limited GPU memory.
GPUs compute independently without synchronization barriers. Faster but may introduce stale gradients.
Training across decentralized devices while keeping data local. Gradients or model updates are aggregated centrally.
Splits long sequences across GPUs for memory efficiency in attention layers. Complements tensor parallelism.
Trades compute for memory by recomputing activations during backward pass instead of storing them.
Decision guide based on your constraints