DISTRIBUTED TRAINING
2026

Parallelism Strategies

How model weights and data are distributed across GPUs

📊
Data Parallelism

Splits the dataset into smaller subsets across multiple GPUs. Each GPU trains a complete replica of the model on its data subset, then gradients are synchronized.

Model GPU 0 Data₀ Model GPU 1 Data₁ Model GPU 2 Data₂ Model GPU 3 Data₃ AllReduce Same model, different data batches
Large Datasets Gradient Sync Model Fits in Memory
🧩
Model Parallelism

Divides the model itself across multiple GPUs. Different GPUs handle different layers or blocks of the model, passing activations between stages.

Layers 0-5 GPU 0 Layers 6-11 GPU 1 Layers 12-17 GPU 2 Layers 18-23 GPU 3 Forward Pass → ← Backward Pass Model split across GPUs, same data
Large Models Sequential Execution Memory Efficient
🔄
Pipeline Parallelism

Combines model parallelism with micro-batching. Mini-batches flow through stages in a pipeline fashion, reducing idle time (bubble) compared to naive model parallelism.

Time → GPU0 GPU1 GPU2 GPU3 μB0 μB1 μB2 μB3 μB0 μB1 μB2 μB3 μB0 μB1 μB2 μB3 μB0 μB1 μB2 μB3 Bubble Micro-batches overlap stages
Deep Models Reduced Bubbles GPipe / PipeDream
🔢
Tensor Parallelism

Splits individual tensor operations across GPUs. Matrix multiplications are partitioned column-wise or row-wise, requiring high-bandwidth interconnect (NVLink).

Matrix A × Matrix B = C A GPU 0 GPU 1 GPU 2 × B GPU 0 GPU 1 GPU 2 = C AllReduce to combine High NVLink Bandwidth Required Row/Column parallel GEMM operations
Transformers Megatron-LM High Bandwidth
🎯
Expert Parallelism (MoE)

Routes tokens to specific expert FFN blocks based on a gating function. Only a subset of experts process each token, enabling massive parameter scaling efficiently.

Tokens Router Expert 0 GPU 0 Expert 1 GPU 1 Expert 2 GPU 2 Expert 3 GPU 3 All-to-All Combine Top-K routing activates subset of experts
Mixtral / GPT-4 Sparse Activation All-to-All Comm
💾
ZeRO Parallelism

Zero Redundancy Optimizer partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across GPUs, dramatically reducing memory footprint.

Memory per GPU Reduction Standard DP Params (Ψ) Grads (Ψ) Opt State (2Ψ) 4Ψ total ZeRO-1 Params (Ψ) Grads (Ψ) Opt (2Ψ/N) ~2.5Ψ ZeRO-2 Params (Ψ) Grad(Ψ/N) Opt(2Ψ/N) ~1.5Ψ ZeRO-3 P(Ψ/N) G(Ψ/N) O(2Ψ/N) 4Ψ/N Memory Reduction → DeepSpeed ZeRO stages
DeepSpeed Memory Efficient GPT-3 Training

How Weights & Data Split Over Cores

Visual comparison of parallelism strategies

Data Parallelism

Model Weights

Same

Replicated on all GPUs

Data

D₀
D₁
D₂
D₃
D₄
D₅
D₆
D₇
D₈

Split across GPUs

Model Parallelism

Model Weights

L0-3
L4-7
L8-11

Layers split

Data

Same

Same batch on all

Model + Data

Model Weights

M₀
M₀
M₁
M₁
M₂
M₂
M₃
M₃

Model × Data parallel

Data

D₀
D₁
D₀
D₁
D₀
D₁
D₀
D₁

2D parallelism

Expert + Data

Model Weights

E₀
E₁
E₂
E₃
E₄
E₅
E₆
E₇
E₈

Unique experts

Data

T₀
T₁
T₂
T₃
T₄
T₅
T₆
T₇
T₈

Routed tokens

Expert + Model + Data

Model Weights

E₀
E₀
E₁
E₁
E₂
E₂
E₃
E₃

3D parallelism

Data

D₀
D₁
D₀
D₁
D₀
D₁
D₀
D₁

Expert × Model × Data

Additional Strategies

Specialized techniques for specific use cases

🔀 Hybrid Parallelism

Combines Data + Tensor + Pipeline parallelism for maximum scale. Used by Megatron-DeepSpeed for trillion-parameter models.

Use Case: GPT-4, Llama-3 405B training

💿 Memory Offloading

Moves optimizer states or parameters to CPU RAM or NVMe when not needed, enabling larger models on limited GPU memory.

Use Case: Single-GPU fine-tuning of large models

🔁 Asynchronous Parallelism

GPUs compute independently without synchronization barriers. Faster but may introduce stale gradients.

Use Case: Distributed RL, Federated Learning

🌐 Federated Learning

Training across decentralized devices while keeping data local. Gradients or model updates are aggregated centrally.

Use Case: Privacy-preserving mobile ML

🎛️ Sequence Parallelism

Splits long sequences across GPUs for memory efficiency in attention layers. Complements tensor parallelism.

Use Case: Long-context transformers (128K+)

Activation Checkpointing

Trades compute for memory by recomputing activations during backward pass instead of storing them.

Use Case: Training with limited GPU memory

Choosing the Right Strategy

Decision guide based on your constraints

Does your model fit on a single GPU? Yes ✓ No ✗ Data Parallelism Is it a dense or sparse (MoE) model? Dense MoE/Sparse Tensor + Pipeline + ZeRO Expert + Data Parallelism
Small model, large data → Data Parallel
Large model → Model/Tensor/Pipeline
Sparse/MoE → Expert Parallelism
Memory limited → ZeRO / Offloading