TalksAWS re:Invent 2025 - Ditch your old SRE playbook: AI SRE for root cause in minutes (AIM260)
AWS re:Invent 2025 - Ditch your old SRE playbook: AI SRE for root cause in minutes (AIM260)
Summary of AWS re:Invent 2025 Presentation: "Ditch your old SRE playbook: AI SRE for root cause in minutes (AIM260)"
Introduction
Presenters: Peter Kantos (Founder & CEO of Resolve AI), Jos, and Angelo
Resolve AI builds AI agents that can use production tools like a software engineer to accelerate incident troubleshooting, remediation, and broader production management
The Challenge of Managing Modern Production Systems
Software engineering has two key parts: coding/building and running production
AI has significantly impacted the coding/building side (e.g. GitHub Copilot, autonomous code generation), but the production/operations side remains a major challenge
Key issues with managing modern production systems:
Complexity of the full software stack, with siloed and unintegrated tools
Dynamic and distributed infrastructure, often across multiple clouds
Need for diverse expertise (application, infrastructure, platform)
Lack of complete documentation, with knowledge spread across tools and human minds
This leads to high costs, frequent war rooms/escalations, and infrastructure over-provisioning, especially for brownfield applications
The Rise of AI for Site Reliability Engineering (SRE)
Humans are no longer the sole operators of production tools and systems
Transition from humans as the operators to agents as the operators, with humans managing the agents
Next step is for agents to become both the operators and the "glue" that stitches together information from disparate sources
Key requirements for effective AI SRE agents:
Understand the full production environment like a human
Learn from every interaction and improve over time
Operate production tools with the same capabilities as humans
Discover and encode human expertise into the agents
Resolve AI's Approach
Resolve AI is a multi-agent system, with agents specialized for code, logs, metrics, infrastructure, knowledge, and documents
Agents work together to accomplish tasks like incident troubleshooting and remediation
Agents learn from past incidents and interactions to improve their performance over time
Agents can operate production tools, gather evidence, and provide recommendations, allowing humans to focus on higher-level tasks
Coinbase's Experience with Resolve AI
Coinbase, a large cryptocurrency exchange, has been using Resolve AI to address production incidents
Key principles for Coinbase's adoption:
Validate the agent's performance against real-world incidents
Test the agent's ability to handle vague prompts and lack of domain knowledge
Integrate the agent with key data sources like Kubernetes and Datadog
Provide the agent with general and team-specific knowledge
Resolve AI has been able to identify root causes of incidents faster than humans, with over 50% accuracy
Future plans include further integration to allow the agent to propose and execute remediations automatically
Key Takeaways
AI agents can be as effective in production operations as they have been in the coding/building side of software engineering
Effective AI SRE requires agents that can understand the full production environment, learn from interactions, and operate production tools with human-level capabilities
Coinbase's experience demonstrates the potential for AI to accelerate incident resolution and free up engineers to focus on higher-value work
Adoption of AI SRE is expected to grow rapidly, with the potential for transformative productivity gains in the software industry
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.