AWS re:Invent 2025 - The incident is over: Now what? (COP216)

AWS re:Invent 2025 - The Incident is Over: Now What? (COP216)

Introduction

Presenters: Georgia (Principal, AWS Enterprise Support) and Anthony (Principal Engineer, AWS Event Management Team)

Combined 20+ years of experience in incident management at AWS

Goal: Share lessons learned, tips, and tricks for incident management and post-incident analysis

Incident Detection and Engagement

Two main categories of event detection:

Service-driven: Alarms on service metrics, subsystem metrics, and synthetic monitoring ("canaries")
Customer-driven: Monitoring for traffic anomalies and customer impact reports

Aggregate alarms that trigger multiple service issues engage a full incident response

Engagement process:

Single-service events: Engage the service team and AWS Support if customers are impacted
Multi-service events: Engage AWS Incident Response (AIR), AWS Support, and "usual suspects" (core services like authentication, DNS, networking)

Incident Coordination and Communication

Technical call: Parallel investigation by all impacted teams to deliver fastest mitigation and resolution

Supported by AWS Incident Response (AIR) team, who own the process, mental models, and tooling
Call led by a senior "call leader" to make critical decisions
Strong etiquette and mindset around what is discussed on the call vs. documented in tickets

Support call: Focuses on customer communication, recovery guidance, and tracking customer sentiment

Leverages data from the technical call to provide real-time updates to customers
Balances speed, accuracy, and depth of communications over time

Mitigation and Resolution

Focus is on mitigating customer impact first, then investigating root cause

Common mitigation techniques:

Shifting traffic away from the failure (e.g. removing instances from load balancer)
Rollbacks of recent deployments
Scaling up resources to handle increased load
Restarting components (as a last resort)
Deploying configuration changes or new software versions (with high confidence)

Risk of recurrence is assessed, and temporary fixes are validated to ensure they are sustainable

Post-Incident Analysis and Lessons Learned

Post-Incident Analysis (COE) document:

Starts with an impact summary and customer experience timeline
Dives into root causes using the "5 Whys" approach to uncover multiple contributing factors
Identifies learnings and action items to prevent recurrence

Balancing speed and quality in COE creation is critical to regain customer trust

Action items are categorized by implementation timeline:

Short-term (hours) to prevent immediate recurrence
Mid-term (days/weeks) to build sustainable solutions
Long-term (months) for systemic changes and reinvention

Learning and Scaling

Team-level education on COEs and operational "tenets" (guiding principles)

Distributed changes via services like Trusted Advisor to scale learnings to customers

Centralized changes through the creation of new AWS services to solve recurring problems

Examples: Route 53, Elastic Load Balancing, AWS Certificate Manager, IAM, STS

Key Takeaways

Engage leaders early and often during incidents

Focus on fast, iterative customer communications during the event

Prioritize mitigation over root cause analysis during the incident

Treat incidents as powerful learning opportunities for the entire organization

Define clear "tenets" to guide decision-making and collaboration

AWS re:Invent 2025 - The incident is over: Now what? (COP216)