TalksAWS re:Invent 2025 - Streamline Amazon EKS operations with Agentic AI (CNS421)
AWS re:Invent 2025 - Streamline Amazon EKS operations with Agentic AI (CNS421)
Streamlining Amazon EKS Operations with Agentic AI
Introduction to Agentic AI
The presenters, Sai Venom and Lucas Dwart, are Principal Solutions Architects at AWS.
They showcase a new "cube agent" - an AI-powered troubleshooting agent that can automatically identify and remediate issues in Kubernetes (EKS) clusters.
The agent is designed to be a "teammate" that can reduce mean-time-to-remediation, going beyond a simple chatbot.
Evolution from Retrieval Augmented Generation (RAG) to Agentic AI
Last year, the presenters demonstrated a RAG-based approach to troubleshooting, which involved:
Chunking log data into a vector store
Using the vector store to retrieve relevant context when a user asked a question
Passing the context and original question to a large language model to generate a response
However, this approach had limitations:
The context window was limited
The model was limited by its training data
The log data was only updated periodically, not in real-time
Introducing Strands and Microagents
To address these limitations, the presenters are using the open-source Strands SDK to build a more sophisticated agentic architecture.
Key features:
Ability to use multiple specialized agents, each with the right language model for the task
Agents can communicate with each other and with external systems like AWS resources and MCP servers
Customizable prompts and communication patterns between agents
The architecture consists of:
An Orchestrator Agent that receives messages from Slack and routes them to the appropriate specialist agent
A Kubernetes Specialist Agent that can directly interact with the EKS cluster and MCP servers
Integrating MCP for Live Data Access
To address the issue of not having real-time data, the presenters leverage the new AWS-hosted EKS MCP server.
This allows the Kubernetes Specialist Agent to directly query the cluster for live data, without the need to set up and manage an MCP server locally.
The agent is configured to use the hosted MCP server, which is integrated through a few lines of code.
Enhancing Message Classification with Nova Micro
The presenters identify a limitation in their initial approach, where the agent would only respond to messages containing specific keywords.
To improve this, they introduce a step where the Orchestrator Agent uses a lightweight Nova Micro language model to classify the intent of the incoming message.
If the message is determined to be related to troubleshooting, it is passed to the Kubernetes Specialist Agent.
Otherwise, the message is ignored, improving efficiency and reducing unnecessary processing.
Capturing Tribal Knowledge with a Memory Agent
The presenters recognize the importance of capturing "tribal knowledge" - the insights and best practices shared by engineers in Slack conversations.
To address this, they introduce a Memory Agent:
The Memory Agent is a standalone agent that can store and retrieve solutions and tips shared by engineers.
It uses S3 vectors to efficiently store and search the embedded text data.
The Orchestrator Agent can interact with the Memory Agent to store new solutions and retrieve relevant information when needed.
Demonstration and Key Capabilities
The presenters walk through a live demonstration, showcasing the agent's ability to:
Automatically detect and troubleshoot a monitoring agent issue in an EKS cluster
Retrieve a previously stored solution for a node exporter image recommendation
Automatically store a new tip about CPU and memory resource definitions
Detect and remediate multiple issues in a demo application namespace
Business Impact and Real-World Applications
The agentic AI approach demonstrated by the presenters can have significant business impact by:
Reducing mean-time-to-remediation for Kubernetes issues
Empowering network and operations engineers with intelligent troubleshooting tools
Capturing and leveraging tribal knowledge to improve efficiency and consistency
The modular, microagent-based architecture allows for scalable and adaptable solutions, where agents can be added, removed, or scaled independently based on the specific needs of the organization.
Key Takeaways
Agentic AI, powered by the Strands SDK, enables the creation of intelligent, specialized agents that can work together to solve complex problems.
Integrating with hosted MCP servers provides easy access to live cluster data, improving the agent's ability to diagnose and remediate issues.
Enhancing message classification with lightweight models like Nova Micro improves efficiency and reduces unnecessary processing.
Capturing tribal knowledge through a dedicated Memory Agent can significantly improve troubleshooting and knowledge sharing within an organization.
The modular, microagent-based architecture allows for scalable and adaptable solutions, tailored to the specific needs of the organization.
Resources
The presenters have provided a GitHub repository with the sample code and resources from the session, which can be accessed by scanning the QR code shown at the end of the presentation.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.