TalksAWS re:Invent 2025 - The incident is over: Now what? (COP216)

AWS re:Invent 2025 - The incident is over: Now what? (COP216)

AWS re:Invent 2025 - The Incident is Over: Now What? (COP216)

Introduction

  • Presenters: Georgia (Principal, AWS Enterprise Support) and Anthony (Principal Engineer, AWS Event Management Team)
  • Combined 20+ years of experience in incident management at AWS
  • Goal: Share lessons learned, tips, and tricks for incident management and post-incident analysis

Incident Detection and Engagement

  • Two main categories of event detection:
    1. Service-driven: Alarms on service metrics, subsystem metrics, and synthetic monitoring ("canaries")
    2. Customer-driven: Monitoring for traffic anomalies and customer impact reports
  • Aggregate alarms that trigger multiple service issues engage a full incident response
  • Engagement process:
    1. Single-service events: Engage the service team and AWS Support if customers are impacted
    2. Multi-service events: Engage AWS Incident Response (AIR), AWS Support, and "usual suspects" (core services like authentication, DNS, networking)

Incident Coordination and Communication

  • Technical call: Parallel investigation by all impacted teams to deliver fastest mitigation and resolution
    • Supported by AWS Incident Response (AIR) team, who own the process, mental models, and tooling
    • Call led by a senior "call leader" to make critical decisions
    • Strong etiquette and mindset around what is discussed on the call vs. documented in tickets
  • Support call: Focuses on customer communication, recovery guidance, and tracking customer sentiment
    • Leverages data from the technical call to provide real-time updates to customers
    • Balances speed, accuracy, and depth of communications over time

Mitigation and Resolution

  • Focus is on mitigating customer impact first, then investigating root cause
  • Common mitigation techniques:
    1. Shifting traffic away from the failure (e.g. removing instances from load balancer)
    2. Rollbacks of recent deployments
    3. Scaling up resources to handle increased load
    4. Restarting components (as a last resort)
    5. Deploying configuration changes or new software versions (with high confidence)
  • Risk of recurrence is assessed, and temporary fixes are validated to ensure they are sustainable

Post-Incident Analysis and Lessons Learned

  • Post-Incident Analysis (COE) document:
    • Starts with an impact summary and customer experience timeline
    • Dives into root causes using the "5 Whys" approach to uncover multiple contributing factors
    • Identifies learnings and action items to prevent recurrence
  • Balancing speed and quality in COE creation is critical to regain customer trust
  • Action items are categorized by implementation timeline:
    1. Short-term (hours) to prevent immediate recurrence
    2. Mid-term (days/weeks) to build sustainable solutions
    3. Long-term (months) for systemic changes and reinvention

Learning and Scaling

  • Team-level education on COEs and operational "tenets" (guiding principles)
  • Distributed changes via services like Trusted Advisor to scale learnings to customers
  • Centralized changes through the creation of new AWS services to solve recurring problems
    • Examples: Route 53, Elastic Load Balancing, AWS Certificate Manager, IAM, STS

Key Takeaways

  1. Engage leaders early and often during incidents
  2. Focus on fast, iterative customer communications during the event
  3. Prioritize mitigation over root cause analysis during the incident
  4. Treat incidents as powerful learning opportunities for the entire organization
  5. Define clear "tenets" to guide decision-making and collaboration

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.