Talks AWS re:Invent 2025 - Master AI model development with Amazon SageMaker AI (AIM272) VIDEO
AWS re:Invent 2025 - Master AI model development with Amazon SageMaker AI (AIM272) Mastering AI Model Development with Amazon SageMaker
Key Challenges in AI Model Development
Customers struggle to differentiate their AI models from competitors who have access to the same generic models
Models need to deeply understand the customer's business, data, and domain expertise to create real value and differentiation
How Models Learn
Models go through a multi-stage learning process similar to human learning:
Pre-training : Models learn general world knowledge by consuming vast amounts of data
Instruction Following : Models learn to follow instructions and provide structured explanations
Preference Optimization : Models learn to generate responses aligned with human preferences and social norms
Reasoning : Models practice multi-step problem solving and structured thinking
Application : Models apply their learning to real-world tasks
Challenges in Pre-Training Models
Efficient Scaling : Large models require distributed training across many GPU/Trinium accelerators
Resiliency : Failures in any part of the distributed training cluster can halt the entire process
Utilization : Ensuring high utilization of the expensive AI infrastructure to avoid cost overruns
Productivity : Providing the right tools so engineers can focus on value-added work rather than infrastructure
Observability : Gaining visibility and insights across the entire training stack
SageMaker Capabilities for Pre-Training
SageMaker Training Jobs: Ephemeral training jobs that spin up infrastructure as needed
SageMaker Hyperparameter Tuning (HyperPod):
Persistent cluster of AI accelerators for training
Automated failover and recovery to improve resiliency
Integrated observability through Prometheus, Grafana, and CloudWatch
Innovations in Hyperparameter Tuning (HyperPod)
Checkpointless Training : Enables recovery from failures in seconds by swapping failed components without restarting the entire cluster
Leverages peer-to-peer recovery of model and optimizer state instead of relying on checkpoints
Reduces recovery time from hours to 1-2 minutes, achieving 95% cluster utilization
Task Governance : Automatically runs jobs based on priority, reduces idle compute, and maximizes cluster utilization
Elastic Training : Allows training jobs to scale up or down compute resources while continuing to make progress
Model Customization Techniques
Supervised Fine-Tuning : Model learns from labeled input-output data to improve task performance and understand the domain
Reinforcement Learning (RL) : Model generates outputs, receives rewards/penalties, and learns to optimize for desired behavior
RL from Human Feedback, RL from AI Feedback, RL from Verifiable Rewards
Direct Preference Optimization (DPO) : Model learns preferences by being shown examples of preferred responses
SageMaker Capabilities for Model Customization
Broad choice of pre-trained models to customize, including Amazon Nova models
Supports all popular fine-tuning techniques (supervised, RL, DPO)
Fully managed experience with serverless training and model evaluation
Integrated experiment tracking with serverless MLflow
Flexible interfaces: code-based, UI-guided, and agent-guided customization
Building with Amazon Nova on SageMaker
Nova Forge: Bridges the gap between foundation model knowledge and organizational/domain knowledge
Key capabilities:
Access to model checkpoints across all phases of development
Blend proprietary data with Amazon Nova curated data
Perform reinforcement learning with custom reward functions
Use push-button recipes to accelerate model development
Leverage responsible AI toolkit
Examples:
Reddit built a highly accurate content moderation solution using Nova Forge
Customers can dial up/down customizable content moderation settings
AWS AI Innovation Center provides assistance in integrating business DNA into custom models
Key Takeaways
SageMaker provides a comprehensive platform to address the challenges in AI model development and customization
Innovations like checkpointless training, task governance, and elastic scaling improve efficiency and resiliency of pre-training
Model customization techniques like supervised fine-tuning, reinforcement learning, and direct preference optimization enable differentiation
Nova Forge on SageMaker allows customers to bridge the gap between foundation models and their unique business knowledge
Your Digital Journey deserves a great story. Build one with us.