TalksAWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & elastic training for AI models (AIM3338)
AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & elastic training for AI models (AIM3338)
AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & Elastic Training for AI Models
Overview of Amazon SageMaker HyperPod
HyperPod is a purpose-built infrastructure for foundational model training and deployment
It offers:
Resilient training environment with automatic health checks and management
Scalable single-rack topology for optimal training performance
Support for a variety of GPU and Trainium-based instance types
Compatibility with popular DL frameworks like PyTorch, TensorFlow, and Nemo
Challenges with Large-Scale Model Training
As model complexity and cluster sizes increase, the probability of node failures also rises significantly
With traditional checkpoint-based recovery, cluster downtime can be 15-30 minutes for large clusters (2,000+ GPUs)
Larger clusters are often shared by multiple teams, leading to uneven utilization and idle capacity
Elastic Training on Amazon SageMaker HyperPod
Enables training jobs to dynamically scale up and down based on available cluster capacity
Automatically scales up when free resources become available, and scales back down when resources are needed by higher priority workloads
Preserves training convergence by maintaining a constant global batch size
Simplifies operations by eliminating the need to manually manage cluster utilization
Architectural details:
Continuous cluster monitoring for scale-up notifications
Graceful preemption to allow lower-priority jobs to run with reduced resources
Automatic training stack reconfiguration to adjust parameters for different world sizes
Automated workload management to respect administrator-defined policies and quotas
Checkpointless Training on Amazon SageMaker HyperPod
Eliminates the need for traditional checkpoint-based recovery, which can take hours on large clusters
Enables sub-minute recovery times, reducing training downtime and improving overall cluster utilization
Key innovations:
Optimized collective communication initialization using peer-to-peer connections
Memory-mapped data loading to avoid repeated data preprocessing during recovery
In-process recovery by replacing failed processes with hot spares
Checkpointless recovery using direct memory hydration from healthy processes
Adoption and Implementation
For standard architectures like LLAMA, Quen, and DeepSpeech, users can get started with zero code changes using HyperPod recipes
For custom training scripts, users can incrementally adopt the elastic and checkpointless features by:
Setting environment variables for optimized collective communication
Integrating the memory-mapped data loader library
Modifying the training loop to use the HyperPod-provided checkpointless strategy and wrapper
Salesforce's Use Case and LZ Penalty
Salesforce AI Research has been using HyperPod for over 2 years, managing a heterogeneous workload of LLM training, fine-tuning, RL, and batch inference
Developed an LZ-based penalty to address the problem of degenerate repetitions in language model sampling
LZ penalty leverages the universal data compression properties of the LZ algorithm to detect and penalize repetitive outputs
Empirically shown to eliminate repetitions while maintaining model accuracy, with negligible impact on inference performance
Key Takeaways
Elastic and checkpointless training on HyperPod enable unprecedented levels of training efficiency and resilience
These features address fundamental challenges of large-scale model training, such as node failures and uneven cluster utilization
Adoption is streamlined through pre-built recipes and modular integration with custom training scripts
Real-world use cases like Salesforce's demonstrate the practical benefits of these innovations in production environments
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.