Scaling gen AI: Tackling operational challenges with Kubernetes (KUB202)
Scaling AI Workloads on Kubernetes: Challenges and Solutions
Infrastructure and AI
The impact of not having GPU resources available to power AI applications is significant for businesses.
Decisions around running workloads on-premises or in the cloud need to consider factors like cloud egress charges and data storage costs.
AI Inference
AI inference happens after the model has been trained, allowing the introduction of new data to keep the model relevant.
Inference workloads require efficient GPU and storage solutions to ensure quick response times.
The Role of Kubernetes in AI Inference
Kubernetes provides scalability, flexibility, and resource utilization optimization for AI workloads.
Key Kubernetes components for operationalizing AI workload scaling:
Horizontal Pod Autoscaler (HPA)
Vertical Pod Autoscaler (VPA)
Cluster Autoscaler
Challenges of Running AI Inference on Kubernetes
Resource management challenges with GPUs, as Kubernetes doesn't natively understand fractional GPU usage.
Scalability issues due to GPU resource scarcity and cost concerns.
Operational complexity in managing horizontal scaling, vertical scaling, and infrastructure scaling.
Storage challenges, including managing large training datasets, high-performance requirements, and data availability/protection.
Addressing the Challenges
Leveraging resource requests and limits to right-size containers.
Employing custom Kubernetes schedulers designed for AI and ML workloads.
Utilizing GPU fractioning techniques:
Time-sharing GPU fractioning
Multi-Instance GPU (MIG) fractioning
Multi-Process Service GPU (MPS) fractioning
Addressing the challenges of autoscaling with GPU fractioning:
Monitoring and understanding fractional GPU usage
Cluster Autoscaler awareness of GPU fractions
Resource visibility for fractional GPU requests
NetApp and AWS Solutions
NetApp's goal is to provide a unified experience for running AI workloads across on-premises, hybrid, and cloud environments.
NetApp Spot Ocean is a serverless compute engine that automates infrastructure optimization, including GPU fractioning, cost visibility, and automated right-sizing.
Spot Ocean leverages extended Kubernetes resources to manage GPU fractions and provides a seamless experience for DevOps teams.
Future Outlook
Continuing to make the GPU fractioning experience seamless for DevOps teams.
Ensuring scarce GPU resources are always available at the best cost and highest availability.
Providing a seamless experience for DevOps teams to manage the complexity of AI workload scaling on Kubernetes.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.