AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

Accelerating AI Workloads with UltraServers on Amazon SageMaker HyperPod

Challenges in Generative AI Development

Difficulty accessing compute resources on-demand due to high demand for accelerated instances

Inefficient allocation of compute resources across teams leading to underutilization

Memory and communication bottlenecks when training large foundation models

Hardware failures disrupting distributed training workflows and causing downtime

Amazon SageMaker HyperPod

Flexible training plans to access compute resources on-demand for short-term needs

Task governance capabilities to efficiently allocate and prioritize compute across teams

Training recipes for pre-optimized distributed training configurations

Proactive health monitoring and auto-recovery to mitigate hardware failures

EC2 UltraServers

Based on AWS-designed Trainium chips or NVIDIA GPUs

Proprietary high-speed interconnect (NVSwitch) to enable multi-server clusters

NVIDIA Superchips combining ARM-based CPU and GPU for efficient memory access

Topology-aware scheduling in SageMaker HyperPod to co-locate data and compute

Mixture of Experts on UltraServers

Partitioning neural networks across multiple GPU "experts" to specialize on different domains

Leveraging UltraServer's high-bandwidth, low-latency interconnect to enable efficient expert-to-expert communication

Scaling to 25,000+ GPUs by interconnecting multiple UltraServer clusters using Elastic Fabric Adapter (EFA)

Key Takeaways

SageMaker HyperPod provides a comprehensive solution to address the unique challenges of generative AI development

EC2 UltraServers deliver unprecedented compute power by combining specialized hardware and high-speed interconnects

Mixture of Experts can further boost training efficiency by partitioning models across specialized GPU "experts"

Customers like Ryder have seen 3x speedups in their model training by leveraging SageMaker HyperPod

Resources

AI on SageMaker HyperPod GitHub repository: Guides and examples for deploying HyperPod clusters

Awesome Distributed Training GitHub repository: Benchmarks and sample code for distributed training on HyperPod

AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

Accelerating AI Workloads with UltraServers on Amazon SageMaker HyperPod

Challenges in Generative AI Development

Amazon SageMaker HyperPod

EC2 UltraServers

Mixture of Experts on UltraServers

Key Takeaways

Resources

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

Accelerating AI Workloads with UltraServers on Amazon SageMaker HyperPod

Challenges in Generative AI Development

Amazon SageMaker HyperPod

EC2 UltraServers

Mixture of Experts on UltraServers

Key Takeaways

Resources

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.