TalksAWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

Accelerating AI Workloads with UltraServers on Amazon SageMaker HyperPod

Challenges in Generative AI Development

  • Difficulty accessing compute resources on-demand due to high demand for accelerated instances
  • Inefficient allocation of compute resources across teams leading to underutilization
  • Memory and communication bottlenecks when training large foundation models
  • Hardware failures disrupting distributed training workflows and causing downtime

Amazon SageMaker HyperPod

  • Flexible training plans to access compute resources on-demand for short-term needs
  • Task governance capabilities to efficiently allocate and prioritize compute across teams
  • Training recipes for pre-optimized distributed training configurations
  • Proactive health monitoring and auto-recovery to mitigate hardware failures

EC2 UltraServers

  • Based on AWS-designed Trainium chips or NVIDIA GPUs
  • Proprietary high-speed interconnect (NVSwitch) to enable multi-server clusters
  • NVIDIA Superchips combining ARM-based CPU and GPU for efficient memory access
  • Topology-aware scheduling in SageMaker HyperPod to co-locate data and compute

Mixture of Experts on UltraServers

  • Partitioning neural networks across multiple GPU "experts" to specialize on different domains
  • Leveraging UltraServer's high-bandwidth, low-latency interconnect to enable efficient expert-to-expert communication
  • Scaling to 25,000+ GPUs by interconnecting multiple UltraServer clusters using Elastic Fabric Adapter (EFA)

Key Takeaways

  • SageMaker HyperPod provides a comprehensive solution to address the unique challenges of generative AI development
  • EC2 UltraServers deliver unprecedented compute power by combining specialized hardware and high-speed interconnects
  • Mixture of Experts can further boost training efficiency by partitioning models across specialized GPU "experts"
  • Customers like Ryder have seen 3x speedups in their model training by leveraging SageMaker HyperPod

Resources

  • AI on SageMaker HyperPod GitHub repository: Guides and examples for deploying HyperPod clusters
  • Awesome Distributed Training GitHub repository: Benchmarks and sample code for distributed training on HyperPod

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.