TalksAWS re:Invent 2025 - Building multi-agent AI SRE: from root cause to vibe debugging (AIM394)
AWS re:Invent 2025 - Building multi-agent AI SRE: from root cause to vibe debugging (AIM394)
Building Multi-Agent AI SRE: From Root Cause to Vibe Debugging
Overview
Resolve AI is an AI system that automates incident troubleshooting, on-call work, and other production activities
Resolve aims to address the challenges of managing complex, constantly-changing production systems and fragmented operational knowledge
Key Challenges in Production Systems
Complexity and Dynamism: Modern production systems are highly complex, composed of many interdependent components that are constantly changing
Hundreds of microservices, legacy components, diverse infrastructure (cloud, on-prem, containers, etc.)
Siloed telemetry data (logs, metrics, traces) across many tools
Production knowledge scattered across various sources
Cross-Team Coordination:
Incident investigations often require expertise and collaboration across multiple teams (application, infrastructure, networking, etc.)
Significant overhead in passing context and maintaining shared understanding between teams
Fragmented Knowledge:
Much of the operational knowledge is undocumented, residing in the heads of senior engineers
Existing documentation and runbooks quickly become outdated as systems evolve
Resolve AI Architecture
To address these challenges, Resolve AI was designed with three key pillars:
Comprehensive Production Understanding:
Resolve plugs into all the tools and data sources in the production environment
Builds a dynamic, time-versioned model of the entire system, including dependencies, topology, and telemetry
Able to operate these tools and query the right data sources efficiently, just like an expert engineer
Cross-Team Expertise Synthesis:
Resolve combines the expertise and knowledge of engineers across different teams and domains
Orchestrates complex, multi-step investigation workflows, dynamically adapting based on new evidence
Collaborates with human engineers, incorporating their feedback to improve its capabilities
Continuous Learning and Collaboration:
Resolve captures both explicit (documented) and implicit (tribal) knowledge from the organization
Learns from human engineers through live interactions, improving its performance over time
Provides a natural language interface for engineers to interact with Resolve
Resolve AI in Action
The presentation demonstrated Resolve AI's capabilities through several use cases:
Infrastructure and Service Exploration:
Resolve was able to provide a comprehensive overview of the AWS infrastructure and application services in the demo environment
Generated visualizations and insights about the RDS database, including dependencies, performance, and problematic queries
Incident Root Cause Analysis:
Resolve autonomously investigated a Postgres deadlock incident, analyzing metrics, logs, and code to identify the root cause
Collaborated with engineers, incorporating their feedback to refine its analysis and provide a detailed timeline and remediation plan
Proactive Operations and Optimization:
Resolve was able to assist with capacity planning and deployment scheduling for a unique workload that didn't fit the typical patterns
Business Impact and Key Takeaways
Resolve AI aims to significantly reduce the time and effort spent by engineers on toil and grunt work in production, allowing them to focus on feature development and innovation
By encapsulating the expertise of senior engineers and continuously learning, Resolve can provide faster incident resolution and more proactive operational support
The architecture of Resolve, with its ability to deeply understand production systems and coordinate cross-team expertise, enables it to tackle complex, real-world production challenges beyond just simple "hello world" scenarios
Partnering with a vendor like Resolve can provide organizations access to the latest advancements in AI-powered operations, without the need to build such capabilities in-house
Zscaler's Experience with Resolve AI
Zscaler, a large-scale cybersecurity platform, has been using Resolve AI to help manage their complex, globally distributed production environment
Key benefits include:
Autonomous investigation of alerts, providing root cause analysis before incidents escalate
Ability to quickly debug issues by leveraging Resolve's comprehensive understanding of the system
Reducing the number of engineers involved in incident response, allowing them to focus on core engineering tasks
Future Outlook
The presenters believe that the adoption of AI-powered operations, beyond just development workflows, will accelerate in the coming year
Resolve AI and similar technologies will play a crucial role in automating toil, improving reliability, and freeing up engineers to focus on high-impact work
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.