[NEW LAUNCH] Investigate operational issues faster with AI (COP379-NEW)
Leveraging AI for Faster Incident Investigation and Resolution
Understanding the Challenges of Reactive Monitoring
The presenter, Ana, shares a personal experience of dealing with an incident as an operations engineer in a previous role:
One day, there was an unexpected issue with the service, leading to a sudden drop in requests.
Ana struggled to identify the root cause, as she was using multiple tools for tracing, metrics, and logs, making it difficult to correlate the information.
The team tried various fixes, including rebooting the infrastructure, but the issue persisted until they finally identified the problem - a client was unable to retrieve a long-lived token.
This experience highlights the common challenges in incident response, such as:
Gathering and correlating data from different sources
Identifying the root cause of issues in complex, distributed systems
Feeling stressed and desperate when trying to resolve the problem
Addressing Challenges with AI-Powered Operations (AIOps)
AIOps is not a magical solution, but a set of algorithms and tools that leverage AI and machine learning to assist operators.
To effectively leverage AIOps, you need to have the right foundations in place:
Instrument your applications to collect the necessary telemetry data
Use standardized conventions for metrics, logs, and tracing
Observe your system from multiple perspectives (inside-out and outside-in)
Implement real-user monitoring and synthetic canaries
Existing AIOps Tools in AWS
CloudWatch Metric Anomaly Detection:
Leverages machine learning to detect anomalies in metrics, including custom metrics.
Considerations include choosing the right metric, adjusting the standard deviation, and handling sparse data.
CloudWatch Log Insights:
Allows you to search and analyze log data across multiple log groups and accounts.
Helps detect patterns in log data and identify changes or anomalies.
New AIOps Features in AWS
Explore Related:
Automatically identifies related resources, logs, and metrics for a given resource or telemetry.
Provides a unified view of the relevant data to aid in investigation and troubleshooting.
Amazon DevOps Guru for operational investigations:
Powered by generative AI and conventional machine learning techniques.
Scans various data sources (logs, metrics, traces, etc.) to identify the root cause of issues and provide remediation steps.
Can be triggered automatically by CloudWatch alarms or initiated manually from the console.
Demonstration
The presenters showcase a demo using the new AIOps features in AWS:
A sample application with a distributed architecture is introduced, where a sudden increase in booking requests leads to Dynamo DB throttling and a degradation in service availability.
The demonstration shows how Amazon DevOps Guru can be used to quickly identify the root cause (a single customer driving high traffic) and suggest mitigation steps (increasing Dynamo DB capacity).
The integration with Slack is also demonstrated, where Amazon DevOps Guru provides updates and suggestions directly in the operations channel.
Key Takeaways
Establish the right foundations for observability, such as instrumentation, standardized conventions, and multi-perspective monitoring.
Leverage existing AIOps tools like CloudWatch Metric Anomaly Detection and CloudWatch Log Insights to detect and investigate issues.
Explore the new AIOps features in AWS, such as Explore Related and Amazon DevOps Guru, to simplify incident response and root cause analysis.
Integrate AIOps tools with your existing communication channels (e.g., Slack) to streamline the incident investigation and resolution process.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.