Chapter Fourteen

Deployment Patterns

Real-world architectures for AI clusters: topology design, high availability, scaling strategies, and production-grade deployment configurations.

3-Tier
Typical Topology
99.999%
Availability Target
<50ms
Failover Time
100K+
GPU Scale

Common Deployment Topologies

Each topology offers different trade-offs between latency, bandwidth, cost, and failure domains. Spine-leaf is the dominant choice for AI clusters due to predictable latency and easy scaling.

Spine 1 Spine 2 Leaf 1 Leaf 2 Leaf 3 Leaf 4 DPU DPU DPU DPU DPU

Spine-Leaf (2-Tier)

The gold standard for AI clusters. Every leaf connects to every spine, providing equal-cost multipath and predictable 3-hop maximum latency.

~500ns latency
Up to 10K nodes
Non-blocking
N+1 redundancy
Super-Spine Pod Spine A Spine B Spine C Pod 1 Pod 2 Pod 3 Pod 4 Pod 5

3-Tier Clos (Fat-Tree)

Extended spine-leaf with super-spine layer for massive scale. Required when GPU count exceeds single spine-leaf capacity (~10K nodes).

~800ns latency
Up to 100K+ nodes
Full bisection BW
Pod isolation
Rail 0 Fabric Rail 1 Fabric Rail 2 Fabric Server 1 (DPU) Server 2 (DPU) Server 3 GPU Rail 0 GPU Rail 1 GPU Rail 2

Rail-Optimized

Separates GPU traffic by NVLink rail number. Each GPU connects to dedicated fabric, eliminating incast at ToR switches during all-reduce.

~400ns latency
8-16 rails typical
No incast
NVIDIA DGX optimized
Group A Group B Group C DPU Router Global Link

Dragonfly+

HPC-inspired topology with local and global links. Nodes within a group communicate directly; inter-group uses high-bandwidth global links.

~600ns local
100K+ scale
Low cable cost
Adaptive routing

Scaling Strategies

DPU-based isolation enables different scaling approaches depending on cluster size, growth trajectory, and workload characteristics.

Pod Scale

8-64 GPUs

Single-rack deployment with direct connectivity. Perfect for development and small training jobs.

Cluster Scale

64-1,024 GPUs

Multi-rack spine-leaf topology. Supports multiple concurrent training jobs with full isolation.

Datacenter Scale

1,024-100K+ GPUs

Multi-tier fabric with super-spine. Foundation model training at frontier scale.

High Availability Architecture

Five-nines availability requires redundancy at every layer: dual DPUs, multiple paths, and sub-50ms failover with no tenant-visible impact.

Active-Active DPU Redundancy with ECMP
Network Fabric
Spine A
400G × 32 ports
Spine B
400G × 32 ports
Leaf Switches
Leaf 1A/1B
Rack 1 - MLAG
Leaf 2A/2B
Rack 2 - MLAG
Leaf 3A/3B
Rack 3 - MLAG
DPU Layer
DPU Primary
Active - 400G
DPU Standby
Hot Standby
GPU Servers
HGX-1
8× H100 80GB
HGX-2
8× H100 80GB
HGX-3
8× H100 80GB
1
Detect
15-50ms
2
Notify
5-10ms
3
Switch
10-20ms
4
Verify
100-500ms

Topology Comparison

Selecting the right deployment pattern depends on scale requirements, budget constraints, and operational complexity tolerance.

Topology Scale Latency Cost Complexity Best For
2-Tier Spine-Leaf Up to 10K nodes ~500ns
Enterprise AI clusters
3-Tier Clos 100K+ nodes ~800ns
Hyperscale training
Rail-Optimized Any (GPU-focused) ~400ns
LLM training clusters
Dragonfly+ Exascale ~600ns-2μs
HPC / Supercomputing

Deployment Best Practices

Lessons learned from production AI cluster deployments with DPU-based tenant isolation.

Plan for Full Bisection

AI workloads require non-blocking fabrics. Oversubscription kills training performance.

  • Use 1:1 oversubscription for training
  • Calculate worst-case all-reduce bandwidth
  • Plan spine capacity for N×leaf uplinks
  • Consider rail-optimized for 8+ GPU/server

Right-Size DPU Deployment

Match DPU count to isolation requirements and traffic patterns.

  • 1 DPU per 4-8 servers typical
  • Dual DPU for high-availability
  • Consider BlueField-4 for >200G/server
  • Pre-provision 10% spare capacity

Define Isolation Boundaries

Establish tenant segmentation strategy before deployment.

  • Map tenants to VLAN/VxLAN ranges
  • Define bandwidth guarantees per tenant
  • Plan burst allowances and limits
  • Document emergency override procedures

Deploy Telemetry First

Comprehensive monitoring is prerequisite for optimization.

  • Enable DPU hardware counters
  • Deploy μs-resolution flow telemetry
  • Set up alerting before production
  • Baseline performance before tenants