Don't get stuck: How connected telemetry keeps you moving forward (COP322)

Here is a detailed summary of the video transcription in markdown format:

Key Takeaways

  • The goal of incident management is to find out which of the five causes are responsible for the incident: change, change in inputs, breach of limits, component failure, or dependency failure.
  • Troubleshooting involves navigating quickly between different observability tools and data sources to find the issue, without getting stuck or overwhelmed.
  • Comprehensive application instrumentation, using tools like OpenTelemetry, is crucial for fast and effective troubleshooting.
  • The ability to break down metrics and logs by various dimensions (e.g., instance, API, customer) is essential for isolating the root cause.
  • Automating the troubleshooting process with AI-driven tools can help teams investigate and mitigate issues more efficiently.

Troubleshooting Approach

  1. Identify the Five Causes

    • Change: Was there a recent change (deployment, configuration, etc.) that could have caused the issue?
    • Change in Inputs: Has the workload or request pattern changed, leading to overload or other issues?
    • Breach of Limits: Have you hit a scaling limit (CPU, memory, etc.) or a dependency limit (certificates, quotas, etc.)?
    • Component Failure: Has a specific component (instance, availability zone, etc.) failed or is performing worse than others?
    • Dependency Failure: Has a dependency (remote service, database, etc.) failed or is performing poorly?
  2. Navigate Efficiently

    • Use observability tools to quickly navigate between infrastructure, applications, and dependencies.
    • Leverage "information scent" to follow the most promising leads and avoid getting stuck.
    • Automate navigation as much as possible to reduce the need for manual steps and context switching.
  3. Leverage Comprehensive Instrumentation

    • Instrument applications with OpenTelemetry to capture detailed telemetry (metrics, logs, traces).
    • Ensure the instrumentation allows for breaking down metrics and logs by relevant dimensions (e.g., instance, API, customer).
    • Use indexing and other features to enable fast, efficient querying of the telemetry data.
  4. Accelerate Investigation with AI

    • Leverage AI-driven tools that can automatically investigate the issue, follow the five causes, and provide hypotheses and recommended actions.
    • These tools can help teams parallelize the investigation and avoid getting stuck or missing important clues.

Demonstration

The speaker demonstrated the troubleshooting process using AWS CloudWatch and related tools:

  1. Navigated from the initial alarm to the load balancer, instances, and application-level metrics and logs to identify the issue.
  2. Recognized that the problem was likely in a dependent service (bot-forge) and shifted the investigation there.
  3. Leveraged the Application Insights feature to visualize the distributed tracing and identify the specific error (access denied) in the bot-forge service.
  4. Used CloudTrail to quickly find the recent change (a resource policy update) that was the root cause of the issue.
  5. Demonstrated the new CloudWatch Investigator feature, which automatically followed the five causes, identified the root issue, and provided a hypothesis and recommended actions.

The speaker emphasized the importance of comprehensive instrumentation, efficient navigation between observability data sources, and the value of AI-driven troubleshooting tools in accelerating the investigation and mitigation process.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us