[NEW LAUNCH] Investigate operational issues faster with AI (COP379-NEW)

Understanding the Challenges of Reactive Monitoring

The presenter, Ana, shares a personal experience of dealing with an incident as an operations engineer in a previous role:

One day, there was an unexpected issue with the service, leading to a sudden drop in requests.
Ana struggled to identify the root cause, as she was using multiple tools for tracing, metrics, and logs, making it difficult to correlate the information.
The team tried various fixes, including rebooting the infrastructure, but the issue persisted until they finally identified the problem - a client was unable to retrieve a long-lived token.
This experience highlights the common challenges in incident response, such as:
- Gathering and correlating data from different sources
- Identifying the root cause of issues in complex, distributed systems
- Feeling stressed and desperate when trying to resolve the problem

Addressing Challenges with AI-Powered Operations (AIOps)

AIOps is not a magical solution, but a set of algorithms and tools that leverage AI and machine learning to assist operators.

To effectively leverage AIOps, you need to have the right foundations in place:

Existing AIOps Tools in AWS

CloudWatch Metric Anomaly Detection:

Leverages machine learning to detect anomalies in metrics, including custom metrics.
Considerations include choosing the right metric, adjusting the standard deviation, and handling sparse data.

CloudWatch Log Insights:

Allows you to search and analyze log data across multiple log groups and accounts.
Helps detect patterns in log data and identify changes or anomalies.

New AIOps Features in AWS

Explore Related:

Automatically identifies related resources, logs, and metrics for a given resource or telemetry.
Provides a unified view of the relevant data to aid in investigation and troubleshooting.

Amazon DevOps Guru for operational investigations:

Powered by generative AI and conventional machine learning techniques.
Scans various data sources (logs, metrics, traces, etc.) to identify the root cause of issues and provide remediation steps.
Can be triggered automatically by CloudWatch alarms or initiated manually from the console.

Demonstration

The presenters showcase a demo using the new AIOps features in AWS:

A sample application with a distributed architecture is introduced, where a sudden increase in booking requests leads to Dynamo DB throttling and a degradation in service availability.
The demonstration shows how Amazon DevOps Guru can be used to quickly identify the root cause (a single customer driving high traffic) and suggest mitigation steps (increasing Dynamo DB capacity).
The integration with Slack is also demonstrated, where Amazon DevOps Guru provides updates and suggestions directly in the operations channel.

Key Takeaways

Establish the right foundations for observability, such as instrumentation, standardized conventions, and multi-perspective monitoring.

Leverage existing AIOps tools like CloudWatch Metric Anomaly Detection and CloudWatch Log Insights to detect and investigate issues.

Explore the new AIOps features in AWS, such as Explore Related and Amazon DevOps Guru, to simplify incident response and root cause analysis.

Integrate AIOps tools with your existing communication channels (e.g., Slack) to streamline the incident investigation and resolution process.

Leveraging AI for Faster Incident Investigation and Resolution