Scaling gen AI: Tackling operational challenges with Kubernetes (KUB202)

Scaling AI Workloads on Kubernetes: Challenges and Solutions

Infrastructure and AI

  • The impact of not having GPU resources available to power AI applications is significant for businesses.
  • Decisions around running workloads on-premises or in the cloud need to consider factors like cloud egress charges and data storage costs.

AI Inference

  • AI inference happens after the model has been trained, allowing the introduction of new data to keep the model relevant.
  • Inference workloads require efficient GPU and storage solutions to ensure quick response times.

The Role of Kubernetes in AI Inference

  • Kubernetes provides scalability, flexibility, and resource utilization optimization for AI workloads.
  • Key Kubernetes components for operationalizing AI workload scaling:
    • Horizontal Pod Autoscaler (HPA)
    • Vertical Pod Autoscaler (VPA)
    • Cluster Autoscaler

Challenges of Running AI Inference on Kubernetes

  • Resource management challenges with GPUs, as Kubernetes doesn't natively understand fractional GPU usage.
  • Scalability issues due to GPU resource scarcity and cost concerns.
  • Operational complexity in managing horizontal scaling, vertical scaling, and infrastructure scaling.
  • Storage challenges, including managing large training datasets, high-performance requirements, and data availability/protection.

Addressing the Challenges

  • Leveraging resource requests and limits to right-size containers.
  • Employing custom Kubernetes schedulers designed for AI and ML workloads.
  • Utilizing GPU fractioning techniques:
    • Time-sharing GPU fractioning
    • Multi-Instance GPU (MIG) fractioning
    • Multi-Process Service GPU (MPS) fractioning
  • Addressing the challenges of autoscaling with GPU fractioning:
    • Monitoring and understanding fractional GPU usage
    • Cluster Autoscaler awareness of GPU fractions
    • Resource visibility for fractional GPU requests

NetApp and AWS Solutions

  • NetApp's goal is to provide a unified experience for running AI workloads across on-premises, hybrid, and cloud environments.
  • NetApp Spot Ocean is a serverless compute engine that automates infrastructure optimization, including GPU fractioning, cost visibility, and automated right-sizing.
  • Spot Ocean leverages extended Kubernetes resources to manage GPU fractions and provides a seamless experience for DevOps teams.

Future Outlook

  • Continuing to make the GPU fractioning experience seamless for DevOps teams.
  • Ensuring scarce GPU resources are always available at the best cost and highest availability.
  • Providing a seamless experience for DevOps teams to manage the complexity of AI workload scaling on Kubernetes.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us