Talks Accelerate innovation with AI-powered operations (COP315) VIDEO
Accelerate innovation with AI-powered operations (COP315) Accelerating Innovation with AI-Powered Operations
Evolution of AI-Ops Services
Rule-based systems:
Precise with low error rates (e.g., AWS Config, Trusted Advisor)
Machine Learning for Anomaly Detection:
Provides probabilities, not certainties (like weather forecasts)
Current Industry State: AI for Incident Management
Ensemble of techniques (rule-based, ML, AI, generative AI) for incident correlation, root cause analysis, and remediation suggestions
Future Expectations:
Predictive insights to forecast future disruptions and capacity needs
Fully automated, self-healing systems (long-term goal)
Challenges Faced by Customers
Overwhelming data volume: Difficulty in knowing where to start investigations and identifying "what changed" in the application.
Proactive detection: Challenges in predicting and identifying unknown issues before they become critical.
Existing AI-Ops Capabilities in Amazon CloudWatch
Pattern Analytics : Automatically surfaces key patterns in log query results, helping to synthesize large volumes of data.
Compare Mode : Visually compares pattern trends across time periods to identify changes.
Anomaly Detection : Continuously evaluates incoming logs, compares to historical baselines, and proactively notifies of anomalous trends.
Natural Language Query Generation : Allows users to generate CloudWatch log and metric insights queries using natural language.
Introducing Amazon DevOps Guru
Motivation:
Lack of comprehensive data for triage
Difficulty in correlating signals to identify root causes
Alarm and tool fatigue
Scattered data across multiple tools
Working Backwards Process:
Identified key themes for an ideal AI-Ops experience
Automatic instrumentation and data collection
Detecting and prioritizing alerts
Guided root cause analysis and operational troubleshooting
Recommended actions
Continuous learning and improvement
Demonstration of Amazon DevOps Guru
Incident scenario: Increased load on a multi-tenant application, causing issues in the visit booking service
Amazon DevOps Guru automatically analyzes various telemetry signals and provides guided troubleshooting
Identifies root cause, suggests mitigation actions, and provides AWS-managed runbooks
Getting Started with Amazon DevOps Guru
Essentials:
Create an Investigation Group for common configuration
Enable application signals, AWS X-Ray, and upgrade CloudWatch/Fluent Bit agents
Best Practices:
Enable CloudTrail logs for change detection
Configure alarms to automatically create investigations
Entry Points:
Auto-triggered investigations from CloudWatch alarms
Initiate investigations from Q chat or embedded dashboards
Key Takeaways:
AI-Ops is an evolving landscape, moving from rule-based to AI-powered Incident Management
Amazon CloudWatch provides AI-Ops capabilities like Pattern Analytics, Compare Mode, and Anomaly Detection
Amazon DevOps Guru offers a guided, AI-powered operational troubleshooting experience
Getting started involves setting up the essentials and adopting best practices
Amazon DevOps Guru can be accessed through alarms, Q chat, and embedded dashboards
Your Digital Journey deserves a great story. Build one with us.