TalksAWS re:Invent 2025 - Reimagining AWS operations with autonomous AI agents (DEV207)
AWS re:Invent 2025 - Reimagining AWS operations with autonomous AI agents (DEV207)
Reimagining AWS Operations with Autonomous AI Agents
Overview
This presentation discusses the use of autonomous AI agents to streamline and automate various cloud operations tasks, with a focus on three key use cases from a customer engagement. The speaker, Gurug from Mental Group, a consulting partner based in Australia and New Zealand, shares insights into the value proposition of autonomous cloud operations, the technical architecture, and the key challenges and lessons learned in implementing these AI-powered solutions.
The Need for Autonomous Cloud Operations
As organizations continue to migrate and modernize their applications on the cloud, the complexity of cloud operations has increased significantly.
This has led to a rise in the number of services, environments, logs, support tickets, and compliance/security requirements that need to be managed.
Manual fixes and triaging of issues can be time-consuming, with the root cause identification often being the most challenging aspect.
The future vision is to have autonomous AI agents that can understand the cloud, code, and policies (compliance, security, etc.) and handle these tasks autonomously, with a human-in-the-loop approach when needed.
This allows human operators to focus on more innovative and strategic tasks, rather than being bogged down by tedious, repetitive work.
Use Cases
The presentation covers three specific use cases where autonomous AI agents were implemented:
1. Automated Compliance Enforcement
The customer was undergoing a large-scale migration of over 500 servers, with strict compliance requirements (e.g., PCI-DSS) that needed to be met.
The compliance criteria were documented in a wiki, and the team used an AI agent to automatically assess each migrated server or environment against the compliance requirements.
The agent would create a Git issue for traceability and then raise a pull request to make the necessary changes in the infrastructure code (Terraform) to address any compliance gaps.
This automated the compliance enforcement process, making the migrations faster and providing better traceability.
2. Automated Incident Resolution
During a modernization project to containerize and migrate a Java monolith application to EKS, the team encountered various configuration and security-related issues.
An AI agent was deployed to monitor the CloudWatch logs and automatically identify the root cause of these issues, creating a Git issue and raising a pull request with the necessary fixes.
This reduced the incident resolution time by over 90%, as the agent could handle the initial triage and fix, and the human team only needed to review and approve the pull request.
It also reduced the reliance on the development team, as the agent could handle most of the minor issues independently.
3. Automating Low-value Requests in EKS
After the customer went live with their new EKS environment, their platform team was overwhelmed with a high number of support tickets from application teams for low-value tasks, such as creating new namespaces or adjusting resource quotas.
An AI agent was integrated with Slack to allow application teams to trigger the agent directly. The agent would then access the necessary documentation and EKS cluster information (in read-only mode) to make the required changes, raising a pull request for approval.
This reduced the support team's workload by 20% and provided faster resolution times (under 5 minutes) for these routine tasks, as the platform team only needed to review and approve the pull requests.
Technical Architecture
The presentation outlines the high-level architecture of the AI agent-based solution:
The system uses an API gateway and CloudWatch logs to trigger the AI agents.
A central orchestrator agent decides which specialized agent (e.g., Confluence, Jira, AWS) to invoke based on the request.
The agents leverage the agent core runtime, which maintains state and learns from previous requests to improve its responses.
The system integrates with various tools and repositories (Git, Jira, Confluence) to execute the necessary actions and provide traceability.
Observability and monitoring are crucial for fine-tuning the system prompts and understanding the performance and cost of the AI agents.
Challenges and Lessons Learned
The key challenges and lessons learned in implementing the autonomous AI agent solution include:
Fine-tuning the system prompts for the agents, which is an iterative process of trial and error to achieve the desired outcomes.
Guarding against agents becoming too "smart" and making unauthorized changes, such as directly applying fixes to the main branch.
Ensuring the agents only focus on the specific issue at hand, rather than trying to fix everything in the codebase.
Addressing federated permissions and trust-building with the business stakeholders.
Emphasizing the importance of robust monitoring and observability to understand the agents' performance and costs.
Conclusion
The presentation concludes by highlighting the key lessons for organizations looking to implement autonomous AI agents:
Start with high-volume, low-value workflows, such as compliance enforcement, troubleshooting, and support tasks.
Focus on well-documented processes and workflows to build a strong foundation for the AI agents.
Iterate, gather feedback, and gradually expand to more complex workflows as the system matures.
Understand that AI agents are not meant to replace humans, but rather to amplify their impact and free them up for more strategic and innovative work.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.