TalksAWS re:Invent 2025 - Build, fine-tune & deploy AI models with SageMaker HyperPod CLI & SDK (AIM371)

AWS re:Invent 2025 - Build, fine-tune & deploy AI models with SageMaker HyperPod CLI & SDK (AIM371)

AWS re:Invent 2025 - Build, fine-tune & deploy AI models with SageMaker HyperPod CLI & SDK (AIM371)

Introduction to SageMaker HyperPod

  • SageMaker HyperPod is a service that allows you to manage and operate persistent clusters to train and deploy foundation models
  • Key benefits of HyperPod:
    • Resilient with built-in failure recovery and remediation
    • Highly scalable with low-latency networking and managed autoscaling
    • Customizable at every layer from hardware to frameworks and observability
    • Efficient through integration with task governance capabilities

HyperPod CLI and SDK

  • HyperPod CLI and SDK provide an abstraction layer on top of Kubernetes for training and deploying AI models
  • Motivations for building the CLI and SDK:
    • Simplify the complex Kubernetes workflow for data scientists and researchers
    • Provide both a command-line and programmatic interface
    • Leverage HyperPod's observability and optimization capabilities

Getting Started with HyperPod

  • Simplified cluster creation experience from the AWS Console
  • Quick setup with opinionated defaults or custom configuration
  • Installing the HyperPod CLI and SDK via pip install

Training AI Models with HyperPod

  • HyperPod Training Operator provides advanced features:
    • Superfast failure recovery by not tearing down containers on failures
    • Custom monitoring to detect issues in training job logs
    • Integration with HyperPod's task governance system

Submitting Training Jobs

  • Using the HyperPod CLI to create and submit training jobs with a single command
  • Demonstrating fault resilience by simulating a node failure and observing the operator's quick remediation
  • Showcasing the HyperPod SDK for a programmatic interface to submit and manage training jobs

Deploying AI Models with HyperPod

  • HyperPod Inference Operator simplifies the deployment of AI models:
    • Handles the setup of load balancers, SSL termination, and autoscaling
    • Supports deploying custom models or pre-trained models from SageMaker JumpStart
    • Provides autoscaling capabilities integrated with CloudWatch or Prometheus

Deploying an Inference Endpoint

  • Demonstrating the creation of an inference endpoint using the HyperPod CLI and SDK
  • Verifying the deployment and invoking the inference endpoint

Optimizing Compute Utilization with HyperPod Task Governance

  • HyperPod Task Governance allows defining policies to prioritize and allocate compute resources
  • Configuring priority classes and compute allocations for different teams (Kubernetes namespaces)
  • Demonstrating how higher-priority workloads can preempt lower-priority ones, and how idle compute can be borrowed

Running IDEs on HyperPod

  • HyperPod supports running IDEs (SageMaker Studio, VS Code, Jupyter Lab) on the cluster
  • Provides fast startup latency by pre-caching container images
  • Integrates with task governance to allow prioritizing and partitioning compute resources for IDEs

Key Takeaways

  • SageMaker HyperPod provides a comprehensive platform for training, deploying, and optimizing AI models at scale
  • The HyperPod CLI and SDK abstract away the complexity of Kubernetes, making it easier for data scientists and researchers to leverage the platform
  • HyperPod's advanced features, such as failure recovery, custom monitoring, and task governance, help maximize the efficiency and utilization of compute resources
  • The ability to run IDEs directly on the HyperPod cluster enables a seamless development and deployment experience

Resources

  • All the code and examples demonstrated in the session are available for download via the provided QR code.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.