TalksAWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & elastic training for AI models (AIM3338)

AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & elastic training for AI models (AIM3338)

AWS re:Invent 2025 - SageMaker HyperPod: Checkpointless & Elastic Training for AI Models

Overview of Amazon SageMaker HyperPod

  • HyperPod is a purpose-built infrastructure for foundational model training and deployment
  • It offers:
    • Resilient training environment with automatic health checks and management
    • Scalable single-rack topology for optimal training performance
    • Support for a variety of GPU and Trainium-based instance types
    • Compatibility with popular DL frameworks like PyTorch, TensorFlow, and Nemo

Challenges with Large-Scale Model Training

  • As model complexity and cluster sizes increase, the probability of node failures also rises significantly
  • With traditional checkpoint-based recovery, cluster downtime can be 15-30 minutes for large clusters (2,000+ GPUs)
  • Larger clusters are often shared by multiple teams, leading to uneven utilization and idle capacity

Elastic Training on Amazon SageMaker HyperPod

  • Enables training jobs to dynamically scale up and down based on available cluster capacity
  • Automatically scales up when free resources become available, and scales back down when resources are needed by higher priority workloads
  • Preserves training convergence by maintaining a constant global batch size
  • Simplifies operations by eliminating the need to manually manage cluster utilization
  • Architectural details:
    • Continuous cluster monitoring for scale-up notifications
    • Graceful preemption to allow lower-priority jobs to run with reduced resources
    • Automatic training stack reconfiguration to adjust parameters for different world sizes
    • Automated workload management to respect administrator-defined policies and quotas

Checkpointless Training on Amazon SageMaker HyperPod

  • Eliminates the need for traditional checkpoint-based recovery, which can take hours on large clusters
  • Enables sub-minute recovery times, reducing training downtime and improving overall cluster utilization
  • Key innovations:
    • Optimized collective communication initialization using peer-to-peer connections
    • Memory-mapped data loading to avoid repeated data preprocessing during recovery
    • In-process recovery by replacing failed processes with hot spares
    • Checkpointless recovery using direct memory hydration from healthy processes

Adoption and Implementation

  • For standard architectures like LLAMA, Quen, and DeepSpeech, users can get started with zero code changes using HyperPod recipes
  • For custom training scripts, users can incrementally adopt the elastic and checkpointless features by:
    • Setting environment variables for optimized collective communication
    • Integrating the memory-mapped data loader library
    • Modifying the training loop to use the HyperPod-provided checkpointless strategy and wrapper

Salesforce's Use Case and LZ Penalty

  • Salesforce AI Research has been using HyperPod for over 2 years, managing a heterogeneous workload of LLM training, fine-tuning, RL, and batch inference
  • Developed an LZ-based penalty to address the problem of degenerate repetitions in language model sampling
    • LZ penalty leverages the universal data compression properties of the LZ algorithm to detect and penalize repetitive outputs
    • Empirically shown to eliminate repetitions while maintaining model accuracy, with negligible impact on inference performance

Key Takeaways

  • Elastic and checkpointless training on HyperPod enable unprecedented levels of training efficiency and resilience
  • These features address fundamental challenges of large-scale model training, such as node failures and uneven cluster utilization
  • Adoption is streamlined through pre-built recipes and modular integration with custom training scripts
  • Real-world use cases like Salesforce's demonstrate the practical benefits of these innovations in production environments

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.