TalksAWS re:Invent 2025 - Building multi-agent AI SRE: from root cause to vibe debugging (AIM394)

AWS re:Invent 2025 - Building multi-agent AI SRE: from root cause to vibe debugging (AIM394)

Building Multi-Agent AI SRE: From Root Cause to Vibe Debugging

Overview

  • Resolve AI is an AI system that automates incident troubleshooting, on-call work, and other production activities
  • Resolve aims to address the challenges of managing complex, constantly-changing production systems and fragmented operational knowledge

Key Challenges in Production Systems

  1. Complexity and Dynamism: Modern production systems are highly complex, composed of many interdependent components that are constantly changing

    • Hundreds of microservices, legacy components, diverse infrastructure (cloud, on-prem, containers, etc.)
    • Siloed telemetry data (logs, metrics, traces) across many tools
    • Production knowledge scattered across various sources
  2. Cross-Team Coordination:

    • Incident investigations often require expertise and collaboration across multiple teams (application, infrastructure, networking, etc.)
    • Significant overhead in passing context and maintaining shared understanding between teams
  3. Fragmented Knowledge:

    • Much of the operational knowledge is undocumented, residing in the heads of senior engineers
    • Existing documentation and runbooks quickly become outdated as systems evolve

Resolve AI Architecture

To address these challenges, Resolve AI was designed with three key pillars:

  1. Comprehensive Production Understanding:

    • Resolve plugs into all the tools and data sources in the production environment
    • Builds a dynamic, time-versioned model of the entire system, including dependencies, topology, and telemetry
    • Able to operate these tools and query the right data sources efficiently, just like an expert engineer
  2. Cross-Team Expertise Synthesis:

    • Resolve combines the expertise and knowledge of engineers across different teams and domains
    • Orchestrates complex, multi-step investigation workflows, dynamically adapting based on new evidence
    • Collaborates with human engineers, incorporating their feedback to improve its capabilities
  3. Continuous Learning and Collaboration:

    • Resolve captures both explicit (documented) and implicit (tribal) knowledge from the organization
    • Learns from human engineers through live interactions, improving its performance over time
    • Provides a natural language interface for engineers to interact with Resolve

Resolve AI in Action

The presentation demonstrated Resolve AI's capabilities through several use cases:

  1. Infrastructure and Service Exploration:

    • Resolve was able to provide a comprehensive overview of the AWS infrastructure and application services in the demo environment
    • Generated visualizations and insights about the RDS database, including dependencies, performance, and problematic queries
  2. Incident Root Cause Analysis:

    • Resolve autonomously investigated a Postgres deadlock incident, analyzing metrics, logs, and code to identify the root cause
    • Collaborated with engineers, incorporating their feedback to refine its analysis and provide a detailed timeline and remediation plan
  3. Proactive Operations and Optimization:

    • Resolve was able to assist with capacity planning and deployment scheduling for a unique workload that didn't fit the typical patterns

Business Impact and Key Takeaways

  • Resolve AI aims to significantly reduce the time and effort spent by engineers on toil and grunt work in production, allowing them to focus on feature development and innovation
  • By encapsulating the expertise of senior engineers and continuously learning, Resolve can provide faster incident resolution and more proactive operational support
  • The architecture of Resolve, with its ability to deeply understand production systems and coordinate cross-team expertise, enables it to tackle complex, real-world production challenges beyond just simple "hello world" scenarios
  • Partnering with a vendor like Resolve can provide organizations access to the latest advancements in AI-powered operations, without the need to build such capabilities in-house

Zscaler's Experience with Resolve AI

  • Zscaler, a large-scale cybersecurity platform, has been using Resolve AI to help manage their complex, globally distributed production environment
  • Key benefits include:
    • Autonomous investigation of alerts, providing root cause analysis before incidents escalate
    • Ability to quickly debug issues by leveraging Resolve's comprehensive understanding of the system
    • Reducing the number of engineers involved in incident response, allowing them to focus on core engineering tasks

Future Outlook

  • The presenters believe that the adoption of AI-powered operations, beyond just development workflows, will accelerate in the coming year
  • Resolve AI and similar technologies will play a crucial role in automating toil, improving reliability, and freeing up engineers to focus on high-impact work

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.