Train large models on Amazon SageMaker for scale and performance (AIM308)

Overview of Generative AI and Challenges for Large Model Training

Generative AI has taken the world by storm, with rapid consumer adoption.

However, organizations are also excited by the ability to train such models on their own data, which can be differentiating for their industries.

As the size of models and datasets grows, the compute needs for training have increased exponentially, leading to new challenges for customers.

Introducing Amazon SageMaker Hyperpod

Training Performance Optimization

Hyperpod is purpose-built infrastructure for training large models, leveraging:

High-bandwidth network with Elastic Fabric Adapter (EFA)
High-performance distributed file system (FSx for Lustre)
Optimized distributed training libraries

These features can reduce training time by up to 40%.

Resiliency and Reliability

Hyperpod provides:

Pre-health checks on instances to detect failures
Continuous monitoring of cluster health and self-healing of nodes
Automatic checkpoint reloading and job restart upon node failures

Ease of Use and Flexibility

Hyperpod provides APIs and console integration for easy cluster management.

Supports multiple job orchestration options (Slurm, EKS) and observability tools (Prometheus, Grafana).

Allows full customization of the compute environment through SSH access.

Case Study: Hippocratic AI's Use of Hyperpod

Hippocratic AI is building AI-powered "autopilot agents" for healthcare applications, such as patient outreach and clinical task automation.

They use a "constellation architecture" with multiple specialized models for safety and reliability.

Upgraded their largest model from 70 billion to 405 billion parameters, requiring optimizations for low-latency inference.

Used Hyperpod for training, leveraging features like fast storage, monitoring, and automated cluster management.

Looking ahead, they are interested in Hyperpod's elastic GPU compute capabilities to handle variable workloads.

Key Takeaways

Hyperpod can reduce training time by up to 40% for large models through infrastructure optimization.

Provides built-in resiliency and reliability features to handle hardware failures.

Offers flexible, customizable, and easy-to-use infrastructure for training and inference.

Companies of all sizes and industries are using Hyperpod to accelerate their generative AI development.

Train large models on Amazon SageMaker for scale and performance (AIM308)

Overview of Generative AI and Challenges for Large Model Training

Introducing Amazon SageMaker Hyperpod

Training Performance Optimization

Resiliency and Reliability

Ease of Use and Flexibility

Case Study: Hippocratic AI's Use of Hyperpod

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Train large models on Amazon SageMaker for scale and performance (AIM308)

Overview of Generative AI and Challenges for Large Model Training

Introducing Amazon SageMaker Hyperpod

Training Performance Optimization

Resiliency and Reliability

Ease of Use and Flexibility

Case Study: Hippocratic AI's Use of Hyperpod

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.