TalksAWS re:Invent 2025 - Streamline Amazon EKS operations with Agentic AI (CNS421)

AWS re:Invent 2025 - Streamline Amazon EKS operations with Agentic AI (CNS421)

Streamlining Amazon EKS Operations with Agentic AI

Introduction to Agentic AI

  • The presenters, Sai Venom and Lucas Dwart, are Principal Solutions Architects at AWS.
  • They showcase a new "cube agent" - an AI-powered troubleshooting agent that can automatically identify and remediate issues in Kubernetes (EKS) clusters.
  • The agent is designed to be a "teammate" that can reduce mean-time-to-remediation, going beyond a simple chatbot.

Evolution from Retrieval Augmented Generation (RAG) to Agentic AI

  • Last year, the presenters demonstrated a RAG-based approach to troubleshooting, which involved:
    • Chunking log data into a vector store
    • Using the vector store to retrieve relevant context when a user asked a question
    • Passing the context and original question to a large language model to generate a response
  • However, this approach had limitations:
    • The context window was limited
    • The model was limited by its training data
    • The log data was only updated periodically, not in real-time

Introducing Strands and Microagents

  • To address these limitations, the presenters are using the open-source Strands SDK to build a more sophisticated agentic architecture.
  • Key features:
    • Ability to use multiple specialized agents, each with the right language model for the task
    • Agents can communicate with each other and with external systems like AWS resources and MCP servers
    • Customizable prompts and communication patterns between agents
  • The architecture consists of:
    1. An Orchestrator Agent that receives messages from Slack and routes them to the appropriate specialist agent
    2. A Kubernetes Specialist Agent that can directly interact with the EKS cluster and MCP servers

Integrating MCP for Live Data Access

  • To address the issue of not having real-time data, the presenters leverage the new AWS-hosted EKS MCP server.
    • This allows the Kubernetes Specialist Agent to directly query the cluster for live data, without the need to set up and manage an MCP server locally.
    • The agent is configured to use the hosted MCP server, which is integrated through a few lines of code.

Enhancing Message Classification with Nova Micro

  • The presenters identify a limitation in their initial approach, where the agent would only respond to messages containing specific keywords.
  • To improve this, they introduce a step where the Orchestrator Agent uses a lightweight Nova Micro language model to classify the intent of the incoming message.
    • If the message is determined to be related to troubleshooting, it is passed to the Kubernetes Specialist Agent.
    • Otherwise, the message is ignored, improving efficiency and reducing unnecessary processing.

Capturing Tribal Knowledge with a Memory Agent

  • The presenters recognize the importance of capturing "tribal knowledge" - the insights and best practices shared by engineers in Slack conversations.
  • To address this, they introduce a Memory Agent:
    • The Memory Agent is a standalone agent that can store and retrieve solutions and tips shared by engineers.
    • It uses S3 vectors to efficiently store and search the embedded text data.
    • The Orchestrator Agent can interact with the Memory Agent to store new solutions and retrieve relevant information when needed.

Demonstration and Key Capabilities

  • The presenters walk through a live demonstration, showcasing the agent's ability to:
    • Automatically detect and troubleshoot a monitoring agent issue in an EKS cluster
    • Retrieve a previously stored solution for a node exporter image recommendation
    • Automatically store a new tip about CPU and memory resource definitions
    • Detect and remediate multiple issues in a demo application namespace

Business Impact and Real-World Applications

  • The agentic AI approach demonstrated by the presenters can have significant business impact by:
    • Reducing mean-time-to-remediation for Kubernetes issues
    • Empowering network and operations engineers with intelligent troubleshooting tools
    • Capturing and leveraging tribal knowledge to improve efficiency and consistency
  • The modular, microagent-based architecture allows for scalable and adaptable solutions, where agents can be added, removed, or scaled independently based on the specific needs of the organization.

Key Takeaways

  1. Agentic AI, powered by the Strands SDK, enables the creation of intelligent, specialized agents that can work together to solve complex problems.
  2. Integrating with hosted MCP servers provides easy access to live cluster data, improving the agent's ability to diagnose and remediate issues.
  3. Enhancing message classification with lightweight models like Nova Micro improves efficiency and reduces unnecessary processing.
  4. Capturing tribal knowledge through a dedicated Memory Agent can significantly improve troubleshooting and knowledge sharing within an organization.
  5. The modular, microagent-based architecture allows for scalable and adaptable solutions, tailored to the specific needs of the organization.

Resources

  • The presenters have provided a GitHub repository with the sample code and resources from the session, which can be accessed by scanning the QR code shown at the end of the presentation.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.