The incident is over: Now what? (ARC207)

Incident Handling and Postmortem Incident Analysis

Key Takeaways:

  1. Engaging Quickly During Incidents: The team has an organization-wide agreement that engaging quickly during incidents is good. Leaders and VPs are willing to be paged and engaged at any time, and they encourage this to aid in faster time to recovery.

  2. Communicating Quickly: The team recommends communicating to customers as soon as they are confident something is happening, even if the details are limited. They can iterate and provide more depth later, but the initial notification should be sent out quickly.

  3. Prioritizing Mitigation over Root Cause: Engineers are often fascinated by the root cause, but the team has to be disciplined in pushing for mitigation first and learning the root cause later. Mitigation is the key priority.

  4. Scaling Learnings Across the Organization: The team tries to ensure that learnings from incidents are not confined to a single team, but are scaled across the organization so that even teams not involved in the incident can benefit from the insights.

  5. Reviewing Incidents with Empathy: The team approaches the postmortem (or "Coe") process with empathy, both for the people involved and for understanding the context and information they had at the time, rather than just judging decisions with hindsight.

  6. Defining and Sharing Organizational Tenets: The team defines a set of concise principles or "tenets" that the entire organization follows, which help guide decision-making and ensure consistency across teams.

Incident Detection and Response

  • The team uses various monitoring approaches, including metrics, alarms, canaries, and aggregate alarms, to detect issues.
  • They have a top-level dashboard to quickly identify multi-service issues that require a coordinated incident response.
  • The incident response team and the support team work in parallel to mitigate the issue and communicate with customers, respectively.
  • The focus is on fast mitigation, with techniques like shifting away from failures, rolling back changes, bouncing services, and proactive scaling.

Postmortem and Action Items

  • Postmortems (or "Coes") are a distributed effort, with multiple teams responsible for specific components.
  • The postmortem process aims to understand the root cause through a structured analysis, including the "five whys" approach.
  • Action items are categorized as short-term, intermediate, and long-term, with a balance between incremental improvements and more substantial, systemic changes.
  • The team validates the effectiveness of the implemented solutions through activities like "game days" to ensure the issue is fully resolved.

Scaling Learnings Across the Organization

  • The team has an organization-wide "Ops" meeting where they review and discuss selected postmortems, allowing for broader sharing of insights and learnings.
  • They also work to automate and centralize solutions, like building platform-level services (e.g., Amazon Route 53, AWS Certificate Manager) to address common challenges.
  • The team defines and shares organizational "tenets" or principles to ensure consistency in decision-making and problem-solving across the company.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us