Talks AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training VIDEO
AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training
NVIDIA Run:ai Overview
NVIDIA Run:ai is a Kubernetes-based GPU orchestration and scheduling platform
Key capabilities:
GPU infrastructure pooling and heterogeneous environment management
Policy-driven governance and resource management
Advanced GPU utilization techniques like fractional GPUs and dynamic memory
Seamless user experience with on-demand access to compute
Open, API-first architecture to integrate with existing tools and frameworks
Run:ai architecture:
Control plane manages multiple distributed Kubernetes/EKS clusters
Clusters aggregate GPU resources into a large compute pool
Run:ai integrates with the clusters to provide advanced scheduling and orchestration
GPU Utilization Optimizations
Fractional GPU technologies:
VGPU, MIG, and Run:ai's own CUDA-based fractional GPU sharing
Enables multiple containers/workloads to share a single physical GPU
Improves user density for development and inference workloads
Dynamic GPU memory:
Allows containers to request a dynamic range of GPU memory
Enables workloads to scale GPU memory usage on-demand
Reduces the need to restart jobs when scaling data/model size
GPU memory swap:
Transparently swaps idle GPU memory to host RAM
Allows suspending and resuming GPU workloads to improve utilization
Can increase GPU utilization from 85-90% up to even higher levels
Scheduling and Resource Management
Kubernetes-based scheduling is not optimal for GPU workloads
Run:ai implements an HPC-inspired scheduler with features like:
Multiple queues, preemption, and reclamation
Guaranteed GPU quotas for teams and projects
Topology-aware scheduling to optimize network and GPU locality
Quotas provide developers reliable access to GPUs, while allowing admins to shift capacity
Tight integration with Amazon SageMaker HyperPod:
HyperPod provides automated health checking and hardware replacement
Run:ai schedules workloads to leverage the resilient HyperPod infrastructure
Demonstration Scenarios
Hardware Fault Tolerance :
Run:ai workload continues running despite a simulated GPU failure
HyperPod automatically replaces the faulty node, reintegrating it into the cluster
Run:ai scales down the workload, then scales it back up on the new node
Multi-Tenant Resource Sharing :
Different teams have guaranteed GPU quotas within the cluster
A team can burst beyond their quota when capacity is available
When a new team requests resources, Run:ai preempts lower-priority workloads
Additional Capabilities
Kubernetes AI Scheduler (KIuler) - Open-sourced scheduling engine
Model Streamer - Optimizes cold starts for large language models
Dynamo - Advanced model serving and inference platform integration
Business Impact
Increased GPU utilization and return on infrastructure investment
Faster time-to-market for AI/ML projects through reliable access to resources
Centralized visibility and control over GPU consumption and allocation
Resilient, fault-tolerant GPU clusters with automated hardware management
Real-World Examples
Customers leveraging fractional GPUs to improve developer density and inference efficiency
Enterprises using GPU memory swap to maximize utilization of large GPU servers
Teams dynamically shifting GPU quotas based on changing business priorities
Your Digital Journey deserves a great story. Build one with us.