TalksAWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

AWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

Scaling AI Model Training on AWS with SageMaker

Importance of Large AI Models

  • Customers and users are increasingly expecting AI-powered experiences
  • Companies are training and customizing large AI models to better serve their use cases
  • These large models require significant compute power, with training reaching up to 10^24 FLOPS (5 YaFLOPS, equivalent to 1,000 P5 GPUs for a month)

SageMaker Training Capabilities

SageMaker Training Jobs

  • Fully managed API for training models
  • Automatically spins up a cluster, trains the model, and delivers the artifact
  • Ephemeral compute, pay-only-for-what-you-use

SageMaker Hyperparameter Tuning (Hyperport)

  • Persistent clusters for training and inference
  • Supports Slurm and Kubernetes (EKS) orchestration
  • Provides more granular control and observability

Key Dimensions for Large Model Training

  1. Compute Availability:

    • Wide selection of GPUs and accelerators (H100, GB200)
    • Flexible capacity options (on-demand, spot, reserved, flexible training plans)
    • Optimized utilization through training job management and task governance
  2. Performance:

    • Handling models larger than available GPU memory through distributed training
    • Data parallelism (data sharded across GPUs) and model parallelism (model sharded across GPUs)
    • High-speed networking (NVLink, Elastic Fabric Adapter) for efficient communication
  3. Resiliency:

    • Mitigation: Checkpoint-based recovery, managed checkpointing
    • Prevention: Health checks, instance isolation
    • Detection: Continuous health monitoring
    • Recovery: Automatic node replacement, job restart from checkpoint
  4. Observability:

    • Comprehensive dashboards for cluster and task-level metrics
    • Prometheus and Grafana-based observability
  5. Ease of Use:

    • Seamless integration with popular frameworks (PyTorch, TensorFlow, Ray)
    • Managed MLflow for tracking training metrics
    • Kubernetes and Slurm orchestration options
  6. Cost Optimization:

    • Efficient utilization of compute resources
    • Leveraging spot instances and reserved capacity

Roblox's Use Case

  • Roblox is a platform with 150 million daily active users and 45 million peak concurrency
  • Roblox's AI infrastructure supports over 1 million queries per second across 350+ models
  • Roblox used SageMaker Hyperport to train a 4D foundational model with 1-70 billion parameters
  • Key benefits of SageMaker Hyperport:
    • Ease of setup and integration with existing EKS infrastructure
    • Access to high-performance GPU capacity
    • Improved resiliency and stability for large-scale training
    • Flexibility to use preferred frameworks (Unicorn, Ray) and tooling

Future Developments

  • Roblox is exploring a "decentralized compute" approach to access GPU capacity across multiple regions and clusters
  • Seeking further integration between SageMaker Hyperport and multi-cluster, multi-region capabilities

Key Takeaways

  • SageMaker provides comprehensive capabilities for training large AI models at scale, addressing compute availability, performance, resiliency, observability, and ease of use
  • SageMaker Hyperport offers a flexible, persistent cluster solution with advanced orchestration and observability features
  • Roblox successfully leveraged SageMaker Hyperport to train a large-scale 4D foundational model, benefiting from the platform's ease of use, performance, and resiliency
  • Future developments aim to further expand the distributed training capabilities across multiple regions and clusters

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.