TalksAWS re:Invent 2025 - Ditch your old SRE playbook: AI SRE for root cause in minutes (AIM260)

AWS re:Invent 2025 - Ditch your old SRE playbook: AI SRE for root cause in minutes (AIM260)

Summary of AWS re:Invent 2025 Presentation: "Ditch your old SRE playbook: AI SRE for root cause in minutes (AIM260)"

Introduction

  • Presenters: Peter Kantos (Founder & CEO of Resolve AI), Jos, and Angelo
  • Resolve AI builds AI agents that can use production tools like a software engineer to accelerate incident troubleshooting, remediation, and broader production management

The Challenge of Managing Modern Production Systems

  • Software engineering has two key parts: coding/building and running production
  • AI has significantly impacted the coding/building side (e.g. GitHub Copilot, autonomous code generation), but the production/operations side remains a major challenge
  • Key issues with managing modern production systems:
    • Complexity of the full software stack, with siloed and unintegrated tools
    • Dynamic and distributed infrastructure, often across multiple clouds
    • Need for diverse expertise (application, infrastructure, platform)
    • Lack of complete documentation, with knowledge spread across tools and human minds
  • This leads to high costs, frequent war rooms/escalations, and infrastructure over-provisioning, especially for brownfield applications

The Rise of AI for Site Reliability Engineering (SRE)

  • Humans are no longer the sole operators of production tools and systems
  • Transition from humans as the operators to agents as the operators, with humans managing the agents
  • Next step is for agents to become both the operators and the "glue" that stitches together information from disparate sources
  • Key requirements for effective AI SRE agents:
    • Understand the full production environment like a human
    • Learn from every interaction and improve over time
    • Operate production tools with the same capabilities as humans
    • Discover and encode human expertise into the agents

Resolve AI's Approach

  • Resolve AI is a multi-agent system, with agents specialized for code, logs, metrics, infrastructure, knowledge, and documents
  • Agents work together to accomplish tasks like incident troubleshooting and remediation
  • Agents learn from past incidents and interactions to improve their performance over time
  • Agents can operate production tools, gather evidence, and provide recommendations, allowing humans to focus on higher-level tasks

Coinbase's Experience with Resolve AI

  • Coinbase, a large cryptocurrency exchange, has been using Resolve AI to address production incidents
  • Key principles for Coinbase's adoption:
    • Validate the agent's performance against real-world incidents
    • Test the agent's ability to handle vague prompts and lack of domain knowledge
    • Integrate the agent with key data sources like Kubernetes and Datadog
    • Provide the agent with general and team-specific knowledge
  • Resolve AI has been able to identify root causes of incidents faster than humans, with over 50% accuracy
  • Future plans include further integration to allow the agent to propose and execute remediations automatically

Key Takeaways

  • AI agents can be as effective in production operations as they have been in the coding/building side of software engineering
  • Effective AI SRE requires agents that can understand the full production environment, learn from interactions, and operate production tools with human-level capabilities
  • Coinbase's experience demonstrates the potential for AI to accelerate incident resolution and free up engineers to focus on higher-value work
  • Adoption of AI SRE is expected to grow rapidly, with the potential for transformative productivity gains in the software industry

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.