Accelerate innovation with AI-powered operations (COP315)

Accelerating Innovation with AI-Powered Operations

Evolution of AI-Ops Services

  1. Rule-based systems:

    • Precise with low error rates (e.g., AWS Config, Trusted Advisor)
  2. Machine Learning for Anomaly Detection:

    • Provides probabilities, not certainties (like weather forecasts)
  3. Current Industry State: AI for Incident Management

    • Ensemble of techniques (rule-based, ML, AI, generative AI) for incident correlation, root cause analysis, and remediation suggestions
  4. Future Expectations:

    • Predictive insights to forecast future disruptions and capacity needs
    • Fully automated, self-healing systems (long-term goal)

Challenges Faced by Customers

  1. Overwhelming data volume: Difficulty in knowing where to start investigations and identifying "what changed" in the application.
  2. Proactive detection: Challenges in predicting and identifying unknown issues before they become critical.

Existing AI-Ops Capabilities in Amazon CloudWatch

  1. Pattern Analytics: Automatically surfaces key patterns in log query results, helping to synthesize large volumes of data.
  2. Compare Mode: Visually compares pattern trends across time periods to identify changes.
  3. Anomaly Detection: Continuously evaluates incoming logs, compares to historical baselines, and proactively notifies of anomalous trends.
  4. Natural Language Query Generation: Allows users to generate CloudWatch log and metric insights queries using natural language.

Introducing Amazon DevOps Guru

  1. Motivation:

    • Lack of comprehensive data for triage
    • Difficulty in correlating signals to identify root causes
    • Alarm and tool fatigue
    • Scattered data across multiple tools
  2. Working Backwards Process:

    • Identified key themes for an ideal AI-Ops experience
    • Automatic instrumentation and data collection
    • Detecting and prioritizing alerts
    • Guided root cause analysis and operational troubleshooting
    • Recommended actions
    • Continuous learning and improvement
  3. Demonstration of Amazon DevOps Guru

    • Incident scenario: Increased load on a multi-tenant application, causing issues in the visit booking service
    • Amazon DevOps Guru automatically analyzes various telemetry signals and provides guided troubleshooting
    • Identifies root cause, suggests mitigation actions, and provides AWS-managed runbooks

Getting Started with Amazon DevOps Guru

  1. Essentials:

    • Create an Investigation Group for common configuration
    • Enable application signals, AWS X-Ray, and upgrade CloudWatch/Fluent Bit agents
  2. Best Practices:

    • Enable CloudTrail logs for change detection
    • Configure alarms to automatically create investigations
  3. Entry Points:

    • Auto-triggered investigations from CloudWatch alarms
    • Initiate investigations from Q chat or embedded dashboards

Key Takeaways:

  • AI-Ops is an evolving landscape, moving from rule-based to AI-powered Incident Management
  • Amazon CloudWatch provides AI-Ops capabilities like Pattern Analytics, Compare Mode, and Anomaly Detection
  • Amazon DevOps Guru offers a guided, AI-powered operational troubleshooting experience
  • Getting started involves setting up the essentials and adopting best practices
  • Amazon DevOps Guru can be accessed through alarms, Q chat, and embedded dashboards

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us