AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training

NVIDIA Run:ai Overview

NVIDIA Run:ai is a Kubernetes-based GPU orchestration and scheduling platform

Key capabilities:

GPU infrastructure pooling and heterogeneous environment management
Policy-driven governance and resource management
Advanced GPU utilization techniques like fractional GPUs and dynamic memory
Seamless user experience with on-demand access to compute
Open, API-first architecture to integrate with existing tools and frameworks

Run:ai architecture:

Control plane manages multiple distributed Kubernetes/EKS clusters
Clusters aggregate GPU resources into a large compute pool
Run:ai integrates with the clusters to provide advanced scheduling and orchestration

GPU Utilization Optimizations

Fractional GPU technologies:

VGPU, MIG, and Run:ai's own CUDA-based fractional GPU sharing
Enables multiple containers/workloads to share a single physical GPU
Improves user density for development and inference workloads

Dynamic GPU memory:

Allows containers to request a dynamic range of GPU memory
Enables workloads to scale GPU memory usage on-demand
Reduces the need to restart jobs when scaling data/model size

GPU memory swap:

Transparently swaps idle GPU memory to host RAM
Allows suspending and resuming GPU workloads to improve utilization
Can increase GPU utilization from 85-90% up to even higher levels

Scheduling and Resource Management

Kubernetes-based scheduling is not optimal for GPU workloads

Run:ai implements an HPC-inspired scheduler with features like:

Multiple queues, preemption, and reclamation
Guaranteed GPU quotas for teams and projects
Topology-aware scheduling to optimize network and GPU locality

Quotas provide developers reliable access to GPUs, while allowing admins to shift capacity

Tight integration with Amazon SageMaker HyperPod:

HyperPod provides automated health checking and hardware replacement
Run:ai schedules workloads to leverage the resilient HyperPod infrastructure

Demonstration Scenarios

Hardware Fault Tolerance:

Run:ai workload continues running despite a simulated GPU failure
HyperPod automatically replaces the faulty node, reintegrating it into the cluster
Run:ai scales down the workload, then scales it back up on the new node

Multi-Tenant Resource Sharing:

Different teams have guaranteed GPU quotas within the cluster
A team can burst beyond their quota when capacity is available
When a new team requests resources, Run:ai preempts lower-priority workloads

Additional Capabilities

Kubernetes AI Scheduler (KIuler) - Open-sourced scheduling engine

Model Streamer - Optimizes cold starts for large language models

Dynamo - Advanced model serving and inference platform integration

Business Impact

Increased GPU utilization and return on infrastructure investment

Faster time-to-market for AI/ML projects through reliable access to resources

Centralized visibility and control over GPU consumption and allocation

Resilient, fault-tolerant GPU clusters with automated hardware management

Real-World Examples

Customers leveraging fractional GPUs to improve developer density and inference efficiency

Enterprises using GPU memory swap to maximize utilization of large GPU servers

Teams dynamically shifting GPU quotas based on changing business priorities

AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training