TalksAWS re:Invent 2025 - Master AI model development with Amazon SageMaker AI (AIM272)

AWS re:Invent 2025 - Master AI model development with Amazon SageMaker AI (AIM272)

Mastering AI Model Development with Amazon SageMaker

Key Challenges in AI Model Development

  • Customers struggle to differentiate their AI models from competitors who have access to the same generic models
  • Models need to deeply understand the customer's business, data, and domain expertise to create real value and differentiation

How Models Learn

  • Models go through a multi-stage learning process similar to human learning:
    1. Pre-training: Models learn general world knowledge by consuming vast amounts of data
    2. Instruction Following: Models learn to follow instructions and provide structured explanations
    3. Preference Optimization: Models learn to generate responses aligned with human preferences and social norms
    4. Reasoning: Models practice multi-step problem solving and structured thinking
    5. Application: Models apply their learning to real-world tasks

Challenges in Pre-Training Models

  1. Efficient Scaling: Large models require distributed training across many GPU/Trinium accelerators
  2. Resiliency: Failures in any part of the distributed training cluster can halt the entire process
  3. Utilization: Ensuring high utilization of the expensive AI infrastructure to avoid cost overruns
  4. Productivity: Providing the right tools so engineers can focus on value-added work rather than infrastructure
  5. Observability: Gaining visibility and insights across the entire training stack

SageMaker Capabilities for Pre-Training

  • SageMaker Training Jobs: Ephemeral training jobs that spin up infrastructure as needed
  • SageMaker Hyperparameter Tuning (HyperPod):
    • Persistent cluster of AI accelerators for training
    • Automated failover and recovery to improve resiliency
    • Integrated observability through Prometheus, Grafana, and CloudWatch

Innovations in Hyperparameter Tuning (HyperPod)

  1. Checkpointless Training: Enables recovery from failures in seconds by swapping failed components without restarting the entire cluster
    • Leverages peer-to-peer recovery of model and optimizer state instead of relying on checkpoints
    • Reduces recovery time from hours to 1-2 minutes, achieving 95% cluster utilization
  2. Task Governance: Automatically runs jobs based on priority, reduces idle compute, and maximizes cluster utilization
  3. Elastic Training: Allows training jobs to scale up or down compute resources while continuing to make progress

Model Customization Techniques

  1. Supervised Fine-Tuning: Model learns from labeled input-output data to improve task performance and understand the domain
  2. Reinforcement Learning (RL): Model generates outputs, receives rewards/penalties, and learns to optimize for desired behavior
    • RL from Human Feedback, RL from AI Feedback, RL from Verifiable Rewards
  3. Direct Preference Optimization (DPO): Model learns preferences by being shown examples of preferred responses

SageMaker Capabilities for Model Customization

  • Broad choice of pre-trained models to customize, including Amazon Nova models
  • Supports all popular fine-tuning techniques (supervised, RL, DPO)
  • Fully managed experience with serverless training and model evaluation
  • Integrated experiment tracking with serverless MLflow
  • Flexible interfaces: code-based, UI-guided, and agent-guided customization

Building with Amazon Nova on SageMaker

  • Nova Forge: Bridges the gap between foundation model knowledge and organizational/domain knowledge
  • Key capabilities:
    1. Access to model checkpoints across all phases of development
    2. Blend proprietary data with Amazon Nova curated data
    3. Perform reinforcement learning with custom reward functions
    4. Use push-button recipes to accelerate model development
    5. Leverage responsible AI toolkit
  • Examples:
    • Reddit built a highly accurate content moderation solution using Nova Forge
    • Customers can dial up/down customizable content moderation settings
  • AWS AI Innovation Center provides assistance in integrating business DNA into custom models

Key Takeaways

  • SageMaker provides a comprehensive platform to address the challenges in AI model development and customization
  • Innovations like checkpointless training, task governance, and elastic scaling improve efficiency and resiliency of pre-training
  • Model customization techniques like supervised fine-tuning, reinforcement learning, and direct preference optimization enable differentiation
  • Nova Forge on SageMaker allows customers to bridge the gap between foundation models and their unique business knowledge

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.