Chapter Fourteen

Deployment Patterns

Real-world architectures for AI clusters: topology design, high availability, scaling strategies, and production-grade deployment configurations.

3-Tier

Typical Topology

99.999%

Availability Target

<50ms

Failover Time

100K+

GPU Scale

Network Architecture

Common Deployment Topologies

Each topology offers different trade-offs between latency, bandwidth, cost, and failure domains. Spine-leaf is the dominant choice for AI clusters due to predictable latency and easy scaling.

Spine-Leaf (2-Tier)

The gold standard for AI clusters. Every leaf connects to every spine, providing equal-cost multipath and predictable 3-hop maximum latency.

~500ns latency

Up to 10K nodes

Non-blocking

N+1 redundancy

3-Tier Clos (Fat-Tree)

Extended spine-leaf with super-spine layer for massive scale. Required when GPU count exceeds single spine-leaf capacity (~10K nodes).

~800ns latency

Up to 100K+ nodes

Full bisection BW

Pod isolation

Rail-Optimized

Separates GPU traffic by NVLink rail number. Each GPU connects to dedicated fabric, eliminating incast at ToR switches during all-reduce.

~400ns latency

8-16 rails typical

No incast

NVIDIA DGX optimized

Dragonfly+

HPC-inspired topology with local and global links. Nodes within a group communicate directly; inter-group uses high-bandwidth global links.

~600ns local

100K+ scale

Low cable cost

Adaptive routing

Capacity Planning

Scaling Strategies

DPU-based isolation enables different scaling approaches depending on cluster size, growth trajectory, and workload characteristics.

Pod Scale

8-64 GPUs

Single-rack deployment with direct connectivity. Perfect for development and small training jobs.

Cluster Scale

64-1,024 GPUs

Multi-rack spine-leaf topology. Supports multiple concurrent training jobs with full isolation.

Datacenter Scale

1,024-100K+ GPUs

Multi-tier fabric with super-spine. Foundation model training at frontier scale.

Reliability

High Availability Architecture

Five-nines availability requires redundancy at every layer: dual DPUs, multiple paths, and sub-50ms failover with no tenant-visible impact.

Active-Active DPU Redundancy with ECMP

Network Fabric

Spine A

400G × 32 ports

Spine B

400G × 32 ports

Leaf Switches

Leaf 1A/1B

Rack 1 - MLAG

Leaf 2A/2B

Rack 2 - MLAG

Leaf 3A/3B

Rack 3 - MLAG

DPU Layer

DPU Primary

Active - 400G

DPU Standby

Hot Standby

GPU Servers

HGX-1

8× H100 80GB

HGX-2

8× H100 80GB

HGX-3

8× H100 80GB

Detect

15-50ms

Notify

5-10ms

Switch

10-20ms

Verify

100-500ms

Decision Matrix

Topology Comparison

Selecting the right deployment pattern depends on scale requirements, budget constraints, and operational complexity tolerance.

Topology	Scale	Latency	Best For
2-Tier Spine-Leaf	Up to 10K nodes	~500ns	Enterprise AI clusters
3-Tier Clos	100K+ nodes	~800ns	Hyperscale training
Rail-Optimized	Any (GPU-focused)	~400ns	LLM training clusters
Dragonfly+	Exascale	~600ns-2μs	HPC / Supercomputing

Implementation Guide

Deployment Best Practices

Lessons learned from production AI cluster deployments with DPU-based tenant isolation.

Plan for Full Bisection

AI workloads require non-blocking fabrics. Oversubscription kills training performance.

Use 1:1 oversubscription for training
Calculate worst-case all-reduce bandwidth
Plan spine capacity for N×leaf uplinks
Consider rail-optimized for 8+ GPU/server

Right-Size DPU Deployment

Match DPU count to isolation requirements and traffic patterns.

1 DPU per 4-8 servers typical
Dual DPU for high-availability
Consider BlueField-4 for >200G/server
Pre-provision 10% spare capacity

Define Isolation Boundaries

Establish tenant segmentation strategy before deployment.

Map tenants to VLAN/VxLAN ranges
Define bandwidth guarantees per tenant
Plan burst allowances and limits
Document emergency override procedures

Deploy Telemetry First

Comprehensive monitoring is prerequisite for optimization.

Enable DPU hardware counters
Deploy μs-resolution flow telemetry
Set up alerting before production
Baseline performance before tenants