TalksAWS re:Invent 2025 - Building agentic workflows for augmented observability (COP405)
AWS re:Invent 2025 - Building agentic workflows for augmented observability (COP405)
Building Agentic Workflows for Augmented Observability
Introduction
Many engineers have experienced being woken up at 3 AM to investigate a production outage by looking through logs
The presentation discusses how to use new "agentic" tools to improve observability and avoid these disruptive incidents
The Evolving Observability Landscape
In the past, customers often lacked the necessary metrics, logs, and traces to effectively solve problems
However, with the rise of tools like OpenTelemetry and CloudWatch, it is now easier than ever to generate observability data
The new challenge is that customers now have too much data to quickly solve problems
The Role of AI in Observability
AI is a powerful tool for observability, as it can help process and analyze the large volumes of data
The key is to provide the AI agent with the right context and access to the necessary observability tools and data
Defining the Agent's Purpose and Capabilities
The agent is set up as an "observability expert" with the ability to:
Discover and analyze AWS resources and their relevant metrics
Correlate data from various observability sources
Create CloudWatch alarms based on identified issues
Generate actionable reports and recommendations
Providing the Agent with Necessary Context
The agent is given detailed instructions on how to approach the analysis, including:
Discovering resources by examining observability data rather than calling all APIs
Analyzing metrics, trends, health status, active alarms, and recent incidents
Linking to relevant postmortem information for any identified issues
Creating new alarms only for genuine gaps, not duplicates
Implementing the Agentic Workflow
Natural Language Query: The analysis is triggered by a natural language query, either manually or from an alarm
Agent Instantiation: The Strands agent SDK is used to quickly create the observability agent with the defined context and access to necessary tools
Observability Data Analysis: The agent uses the provided MCP (Model Context Protocol) tools to query CloudWatch, CloudTrail, and Application Signals, analyzing the data for insights
Report Generation: The agent generates a structured report with an executive summary, identified issues, operational audit, recommendations, and links to relevant incidents
Technical Implementation Details
The MCP manager is used to provide the agent with access to CloudWatch, CloudTrail, and Application Signals tools, allowing it to query the necessary data
A custom "CreateCloudWatchAlarm" tool is created to ensure the agent creates alarms with the appropriate configuration, rather than trying to do it directly
The report is generated using a Jinja template, allowing the agent's findings to be formatted into a human-readable HTML document
Business Impact and Real-World Applications
By automating the observability analysis and report generation, the solution can help avoid disruptive 3 AM incidents by proactively identifying and addressing issues
The agentic workflow can be deployed to run periodically or triggered by alarms, ensuring the organization is constantly monitoring for potential problems
The detailed, actionable reports provide valuable insights to stakeholders, enabling data-driven decision-making and continuous improvement of the observability posture
Example Use Case
The demonstration uses a fictional pet adoption website as an example, with the agent analyzing the observability data and generating a comprehensive report
The report identifies performance issues, security concerns, and opportunities for cost optimization, providing clear recommendations for immediate, short-term, and long-term actions
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.