Talks AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362) VIDEO
AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362) Accelerating AI Workloads with UltraServers on Amazon SageMaker HyperPod
Challenges in Generative AI Development
Difficulty accessing compute resources on-demand due to high demand for accelerated instances
Inefficient allocation of compute resources across teams leading to underutilization
Memory and communication bottlenecks when training large foundation models
Hardware failures disrupting distributed training workflows and causing downtime
Amazon SageMaker HyperPod
Flexible training plans to access compute resources on-demand for short-term needs
Task governance capabilities to efficiently allocate and prioritize compute across teams
Training recipes for pre-optimized distributed training configurations
Proactive health monitoring and auto-recovery to mitigate hardware failures
EC2 UltraServers
Based on AWS-designed Trainium chips or NVIDIA GPUs
Proprietary high-speed interconnect (NVSwitch) to enable multi-server clusters
NVIDIA Superchips combining ARM-based CPU and GPU for efficient memory access
Topology-aware scheduling in SageMaker HyperPod to co-locate data and compute
Mixture of Experts on UltraServers
Partitioning neural networks across multiple GPU "experts" to specialize on different domains
Leveraging UltraServer's high-bandwidth, low-latency interconnect to enable efficient expert-to-expert communication
Scaling to 25,000+ GPUs by interconnecting multiple UltraServer clusters using Elastic Fabric Adapter (EFA)
Key Takeaways
SageMaker HyperPod provides a comprehensive solution to address the unique challenges of generative AI development
EC2 UltraServers deliver unprecedented compute power by combining specialized hardware and high-speed interconnects
Mixture of Experts can further boost training efficiency by partitioning models across specialized GPU "experts"
Customers like Ryder have seen 3x speedups in their model training by leveraging SageMaker HyperPod
Resources
AI on SageMaker HyperPod GitHub repository: Guides and examples for deploying HyperPod clusters
Awesome Distributed Training GitHub repository: Benchmarks and sample code for distributed training on HyperPod
Your Digital Journey deserves a great story. Build one with us.