TalksAWS re:Invent 2025 - Generative and Agentic AI on Amazon EKS (CNS344)

AWS re:Invent 2025 - Generative and Agentic AI on Amazon EKS (CNS344)

Leveraging Kubernetes for Generative and Agentic AI

The AI Technology Revolution

  • We are in the midst of a rapid AI technology revolution that is transforming the world
  • This revolution is happening much faster than previous technology shifts, with new foundational models emerging every 2 days
  • The pace of change is overwhelming, causing anxiety for many businesses and individuals

Running AI on Kubernetes

  • Customers are increasingly choosing Kubernetes (EKS) to run both agentic AI agents and inference/fine-tuning workloads
  • Key reasons include:
    1. Control and optimization of underlying infrastructure
    2. Portability across clouds and on-premises
    3. Single platform for all workloads (business apps, AI agents, inference, fine-tuning)

Deploying Agentic AI Agents on EKS

  • Agentic AI agents can solve problems requiring reasoning, unlike traditional software
  • Agentic frameworks like Strand provide a Python-based approach to building AI agents
  • Agents interact with large language models (LLMs) and external APIs/data sources via "tools"
  • Agents can be containerized and deployed to EKS like any other service
  • Key capabilities include authentication, observability (logs, metrics, traces), and memory management

Running Inference and Fine-Tuning on EKS

  • Inference and fine-tuning workloads have different patterns than agents:
    • Inference is variable/bursty, fine-tuning requires steady capacity
    • Both run on GPUs, requiring careful selection and optimization
  • Selecting the right GPU instance:
    • Determine model size and memory requirements
    • Use quantization techniques to reduce GPU memory needs
    • Choose from AWS GPU instance families (G5, G6) based on requirements
  • Provisioning GPUs in EKS:
    • Use EKS Autoscaler or open-source Cluster Autoscaler
    • Leverage EKS AMIs with pre-configured GPU drivers and software
    • Optimize container pull and model loading times
  • Observing and managing GPU health:
    • Monitor GPU utilization, temperature, power
    • Use node health monitoring and auto-repair features
  • Scaling inference workloads:
    • Use custom metrics and the Horizontal Pod Autoscaler (HPA)
    • Leverage inference frameworks like AI Bricks, Ray, and Deta

Recent EKS Innovations for AI Workloads

  • New GPU instance types: GB200, GB300, P6B200, P6B300, P54X
  • Dynamic Resource Allocation (DRRA) for flexible GPU allocation
  • Fast container pulls with Sochi for improved model loading
  • Enhancements to Cluster Autoscaler and EKS Autoscaler
  • EKS Provisioned Control Plane for high-scale AI workloads
  • ALB Target Optimizer for efficient inference load balancing
  • Hosted EKS MCP Server for AI agent observability and control

The Future of AI on EKS

  • Continued focus on providing a reliable, optimized foundation for AI workloads
  • Extending existing tools and automation to support GPU-based AI
  • Introducing higher-level AI-specific capabilities out-of-the-box
  • Leveraging AI to make EKS itself more intelligent and automated

Getting Started

  • Attend AWS workshops on inference and agentic AI on EKS
  • Use the AI/ML EKS user guide and Terraform blueprints
  • Check out related talks at re:Invent for more details

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.