TalksAWS re:Invent 2025 - From Alert to Resolution: Supercharge AWS Ops with the Agentic AI SRE (AIM225)

AWS re:Invent 2025 - From Alert to Resolution: Supercharge AWS Ops with the Agentic AI SRE (AIM225)

AWS re:Invent 2025 - From Alert to Resolution: Supercharge AWS Ops with the Agentic AI SRE (AIM225)

Overview

  • Presentation focused on how "Agentic AI SRE" can help solve problems in production environments by automating the process of investigating and resolving issues.
  • Key goals are to reduce mean time to resolution (MTTR), decrease outages, and improve system availability and uptime.
  • Agentic AI SRE, named "Hawkeye", can be deployed as a SaaS product or within a customer's VPC for security-conscious organizations.

Challenges Addressed

  • Dynamic cloud environments with multiple layers, telemetry sources, and alerts lead to complex troubleshooting workflows.
  • Manual processes of logging into various tools, analyzing dashboards, and coordinating teams are time-consuming and stressful.
  • Finger-pointing and lack of clear root cause analysis often occur during incident response.

Agentic AI SRE Capabilities

  • Automatically triggers investigations based on alerts from monitoring tools like DataDog, Prometheus, etc.
  • Analyzes a wide range of data sources, including traces, logs, metrics, and configuration details, to determine root cause.
  • Provides a detailed summary of the incident, including the description, timeline, evidence, root cause, and recommended corrective actions.
  • Delivers the analysis and recommendations through various channels, such as Slack, Teams, ServiceNow, or directly within coding assistants like Cloud Code.

Business Impact

  • Significant productivity gains for SRE teams by automating the investigation process.
  • Reduced outages and improved system availability, leading to cost savings for customers.
  • Examples of use cases:
    • Insurance company with complex on-premises and AWS footprint, focusing on reducing incident costs.
    • Shipping and logistics company managing data pipelines and nightly jobs, ensuring timely completion.
    • Bank in Australia using Agentic AI SRE for backup management and regulatory compliance.

Integration with Coding Assistants

  • Agentic AI SRE integrates with coding assistants like Cloud Code, providing a self-documenting API (MCP server) for seamless interaction.
  • SREs can use the coding assistant to onboard Agentic AI SRE, manage connections, initiate investigations, and automate remediation steps.
  • Ability to summarize past investigations, identify patterns, and proactively address recurring issues.

Deployment and Pricing

  • Self-service onboarding available through the AWS Marketplace, with a free trial and pay-as-you-go model.
  • Enterprise customers can deploy Agentic AI SRE within their VPC, use their own language models, and integrate with custom observability and incident management tools.

Conclusion

  • Agentic AI SRE, or "Hawkeye," aims to revolutionize incident response and problem-solving in production environments by automating the investigation process.
  • The solution integrates with various monitoring and observability tools, providing a comprehensive and streamlined approach to root cause analysis and resolution.
  • By empowering SRE teams with AI-driven insights and automation, Agentic AI SRE promises to deliver significant productivity gains, reduced outages, and improved system availability.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.