AWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

Scaling AI Model Training on AWS with SageMaker

Importance of Large AI Models

Customers and users are increasingly expecting AI-powered experiences

Companies are training and customizing large AI models to better serve their use cases

These large models require significant compute power, with training reaching up to 10^24 FLOPS (5 YaFLOPS, equivalent to 1,000 P5 GPUs for a month)

SageMaker Training Capabilities

SageMaker Training Jobs

Fully managed API for training models

Automatically spins up a cluster, trains the model, and delivers the artifact

Ephemeral compute, pay-only-for-what-you-use

SageMaker Hyperparameter Tuning (Hyperport)

Persistent clusters for training and inference

Supports Slurm and Kubernetes (EKS) orchestration

Provides more granular control and observability

Key Dimensions for Large Model Training

Compute Availability:

Wide selection of GPUs and accelerators (H100, GB200)
Flexible capacity options (on-demand, spot, reserved, flexible training plans)
Optimized utilization through training job management and task governance

Performance:

Handling models larger than available GPU memory through distributed training
Data parallelism (data sharded across GPUs) and model parallelism (model sharded across GPUs)
High-speed networking (NVLink, Elastic Fabric Adapter) for efficient communication

Resiliency:

Mitigation: Checkpoint-based recovery, managed checkpointing
Prevention: Health checks, instance isolation
Detection: Continuous health monitoring
Recovery: Automatic node replacement, job restart from checkpoint

Observability:

Comprehensive dashboards for cluster and task-level metrics
Prometheus and Grafana-based observability

Ease of Use:

Seamless integration with popular frameworks (PyTorch, TensorFlow, Ray)
Managed MLflow for tracking training metrics
Kubernetes and Slurm orchestration options

Cost Optimization:

Efficient utilization of compute resources
Leveraging spot instances and reserved capacity

Roblox's Use Case

Roblox is a platform with 150 million daily active users and 45 million peak concurrency

Roblox's AI infrastructure supports over 1 million queries per second across 350+ models

Roblox used SageMaker Hyperport to train a 4D foundational model with 1-70 billion parameters

Key benefits of SageMaker Hyperport:

Ease of setup and integration with existing EKS infrastructure
Access to high-performance GPU capacity
Improved resiliency and stability for large-scale training
Flexibility to use preferred frameworks (Unicorn, Ray) and tooling

Future Developments

Roblox is exploring a "decentralized compute" approach to access GPU capacity across multiple regions and clusters

Seeking further integration between SageMaker Hyperport and multi-cluster, multi-region capabilities

Key Takeaways

SageMaker provides comprehensive capabilities for training large AI models at scale, addressing compute availability, performance, resiliency, observability, and ease of use

SageMaker Hyperport offers a flexible, persistent cluster solution with advanced orchestration and observability features

Roblox successfully leveraged SageMaker Hyperport to train a large-scale 4D foundational model, benefiting from the platform's ease of use, performance, and resiliency

Future developments aim to further expand the distributed training capabilities across multiple regions and clusters

AWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

Scaling AI Model Training on AWS with SageMaker

Importance of Large AI Models

SageMaker Training Capabilities

SageMaker Training Jobs

SageMaker Hyperparameter Tuning (Hyperport)

Key Dimensions for Large Model Training

Roblox's Use Case

Future Developments

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

Scaling AI Model Training on AWS with SageMaker

Importance of Large AI Models

SageMaker Training Capabilities

SageMaker Training Jobs

SageMaker Hyperparameter Tuning (Hyperport)

Key Dimensions for Large Model Training

Roblox's Use Case

Future Developments

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.