TalksAWS re:Invent 2025 - Building agentic workflows for augmented observability (COP405)

AWS re:Invent 2025 - Building agentic workflows for augmented observability (COP405)

Building Agentic Workflows for Augmented Observability

Introduction

  • Many engineers have experienced being woken up at 3 AM to investigate a production outage by looking through logs
  • The presentation discusses how to use new "agentic" tools to improve observability and avoid these disruptive incidents

The Evolving Observability Landscape

  • In the past, customers often lacked the necessary metrics, logs, and traces to effectively solve problems
  • However, with the rise of tools like OpenTelemetry and CloudWatch, it is now easier than ever to generate observability data
  • The new challenge is that customers now have too much data to quickly solve problems

The Role of AI in Observability

  • AI is a powerful tool for observability, as it can help process and analyze the large volumes of data
  • The key is to provide the AI agent with the right context and access to the necessary observability tools and data

Defining the Agent's Purpose and Capabilities

  • The agent is set up as an "observability expert" with the ability to:
    • Discover and analyze AWS resources and their relevant metrics
    • Correlate data from various observability sources
    • Create CloudWatch alarms based on identified issues
    • Generate actionable reports and recommendations

Providing the Agent with Necessary Context

  • The agent is given detailed instructions on how to approach the analysis, including:
    • Discovering resources by examining observability data rather than calling all APIs
    • Analyzing metrics, trends, health status, active alarms, and recent incidents
    • Linking to relevant postmortem information for any identified issues
    • Creating new alarms only for genuine gaps, not duplicates

Implementing the Agentic Workflow

  1. Natural Language Query: The analysis is triggered by a natural language query, either manually or from an alarm
  2. Agent Instantiation: The Strands agent SDK is used to quickly create the observability agent with the defined context and access to necessary tools
  3. Observability Data Analysis: The agent uses the provided MCP (Model Context Protocol) tools to query CloudWatch, CloudTrail, and Application Signals, analyzing the data for insights
  4. Report Generation: The agent generates a structured report with an executive summary, identified issues, operational audit, recommendations, and links to relevant incidents

Technical Implementation Details

  • The MCP manager is used to provide the agent with access to CloudWatch, CloudTrail, and Application Signals tools, allowing it to query the necessary data
  • A custom "CreateCloudWatchAlarm" tool is created to ensure the agent creates alarms with the appropriate configuration, rather than trying to do it directly
  • The report is generated using a Jinja template, allowing the agent's findings to be formatted into a human-readable HTML document

Business Impact and Real-World Applications

  • By automating the observability analysis and report generation, the solution can help avoid disruptive 3 AM incidents by proactively identifying and addressing issues
  • The agentic workflow can be deployed to run periodically or triggered by alarms, ensuring the organization is constantly monitoring for potential problems
  • The detailed, actionable reports provide valuable insights to stakeholders, enabling data-driven decision-making and continuous improvement of the observability posture

Example Use Case

  • The demonstration uses a fictional pet adoption website as an example, with the agent analyzing the observability data and generating a comprehensive report
  • The report identifies performance issues, security concerns, and opportunities for cost optimization, providing clear recommendations for immediate, short-term, and long-term actions

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.