TalksAWS re:Invent 2025 - Build, fine-tune & deploy AI models with SageMaker HyperPod CLI & SDK (AIM371)
AWS re:Invent 2025 - Build, fine-tune & deploy AI models with SageMaker HyperPod CLI & SDK (AIM371)
AWS re:Invent 2025 - Build, fine-tune & deploy AI models with SageMaker HyperPod CLI & SDK (AIM371)
Introduction to SageMaker HyperPod
SageMaker HyperPod is a service that allows you to manage and operate persistent clusters to train and deploy foundation models
Key benefits of HyperPod:
Resilient with built-in failure recovery and remediation
Highly scalable with low-latency networking and managed autoscaling
Customizable at every layer from hardware to frameworks and observability
Efficient through integration with task governance capabilities
HyperPod CLI and SDK
HyperPod CLI and SDK provide an abstraction layer on top of Kubernetes for training and deploying AI models
Motivations for building the CLI and SDK:
Simplify the complex Kubernetes workflow for data scientists and researchers
Provide both a command-line and programmatic interface
Leverage HyperPod's observability and optimization capabilities
Getting Started with HyperPod
Simplified cluster creation experience from the AWS Console
Quick setup with opinionated defaults or custom configuration
Installing the HyperPod CLI and SDK via pip install
Training AI Models with HyperPod
HyperPod Training Operator provides advanced features:
Superfast failure recovery by not tearing down containers on failures
Custom monitoring to detect issues in training job logs
Integration with HyperPod's task governance system
Submitting Training Jobs
Using the HyperPod CLI to create and submit training jobs with a single command
Demonstrating fault resilience by simulating a node failure and observing the operator's quick remediation
Showcasing the HyperPod SDK for a programmatic interface to submit and manage training jobs
Deploying AI Models with HyperPod
HyperPod Inference Operator simplifies the deployment of AI models:
Handles the setup of load balancers, SSL termination, and autoscaling
Supports deploying custom models or pre-trained models from SageMaker JumpStart
Provides autoscaling capabilities integrated with CloudWatch or Prometheus
Deploying an Inference Endpoint
Demonstrating the creation of an inference endpoint using the HyperPod CLI and SDK
Verifying the deployment and invoking the inference endpoint
Optimizing Compute Utilization with HyperPod Task Governance
HyperPod Task Governance allows defining policies to prioritize and allocate compute resources
Configuring priority classes and compute allocations for different teams (Kubernetes namespaces)
Demonstrating how higher-priority workloads can preempt lower-priority ones, and how idle compute can be borrowed
Running IDEs on HyperPod
HyperPod supports running IDEs (SageMaker Studio, VS Code, Jupyter Lab) on the cluster
Provides fast startup latency by pre-caching container images
Integrates with task governance to allow prioritizing and partitioning compute resources for IDEs
Key Takeaways
SageMaker HyperPod provides a comprehensive platform for training, deploying, and optimizing AI models at scale
The HyperPod CLI and SDK abstract away the complexity of Kubernetes, making it easier for data scientists and researchers to leverage the platform
HyperPod's advanced features, such as failure recovery, custom monitoring, and task governance, help maximize the efficiency and utilization of compute resources
The ability to run IDEs directly on the HyperPod cluster enables a seamless development and deployment experience
Resources
All the code and examples demonstrated in the session are available for download via the provided QR code.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.