AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & elastic training for AI models (AIM3338)

AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & Elastic Training for AI Models

Overview of Amazon SageMaker HyperPod

HyperPod is a purpose-built infrastructure for foundational model training and deployment

It offers:

Resilient training environment with automatic health checks and management
Scalable single-rack topology for optimal training performance
Support for a variety of GPU and Trainium-based instance types
Compatibility with popular DL frameworks like PyTorch, TensorFlow, and Nemo

Challenges with Large-Scale Model Training

As model complexity and cluster sizes increase, the probability of node failures also rises significantly

With traditional checkpoint-based recovery, cluster downtime can be 15-30 minutes for large clusters (2,000+ GPUs)

Larger clusters are often shared by multiple teams, leading to uneven utilization and idle capacity

Elastic Training on Amazon SageMaker HyperPod

Enables training jobs to dynamically scale up and down based on available cluster capacity

Automatically scales up when free resources become available, and scales back down when resources are needed by higher priority workloads

Preserves training convergence by maintaining a constant global batch size

Simplifies operations by eliminating the need to manually manage cluster utilization

Architectural details:

Continuous cluster monitoring for scale-up notifications
Graceful preemption to allow lower-priority jobs to run with reduced resources
Automatic training stack reconfiguration to adjust parameters for different world sizes
Automated workload management to respect administrator-defined policies and quotas

Checkpointless Training on Amazon SageMaker HyperPod

Eliminates the need for traditional checkpoint-based recovery, which can take hours on large clusters

Enables sub-minute recovery times, reducing training downtime and improving overall cluster utilization

Key innovations:

Optimized collective communication initialization using peer-to-peer connections
Memory-mapped data loading to avoid repeated data preprocessing during recovery
In-process recovery by replacing failed processes with hot spares
Checkpointless recovery using direct memory hydration from healthy processes

Adoption and Implementation

For standard architectures like LLAMA, Quen, and DeepSpeech, users can get started with zero code changes using HyperPod recipes

For custom training scripts, users can incrementally adopt the elastic and checkpointless features by:

Setting environment variables for optimized collective communication
Integrating the memory-mapped data loader library
Modifying the training loop to use the HyperPod-provided checkpointless strategy and wrapper

Salesforce's Use Case and LZ Penalty

Salesforce AI Research has been using HyperPod for over 2 years, managing a heterogeneous workload of LLM training, fine-tuning, RL, and batch inference

Developed an LZ-based penalty to address the problem of degenerate repetitions in language model sampling

LZ penalty leverages the universal data compression properties of the LZ algorithm to detect and penalize repetitive outputs
Empirically shown to eliminate repetitions while maintaining model accuracy, with negligible impact on inference performance

Key Takeaways

Elastic and checkpointless training on HyperPod enable unprecedented levels of training efficiency and resilience

These features address fundamental challenges of large-scale model training, such as node failures and uneven cluster utilization

Adoption is streamlined through pre-built recipes and modular integration with custom training scripts

Real-world use cases like Salesforce's demonstrate the practical benefits of these innovations in production environments

AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & elastic training for AI models (AIM3338)