Talks AWS re:Invent 2025 - The incident is over: Now what? (COP216) VIDEO
AWS re:Invent 2025 - The incident is over: Now what? (COP216) AWS re:Invent 2025 - The Incident is Over: Now What? (COP216)
Introduction
Presenters: Georgia (Principal, AWS Enterprise Support) and Anthony (Principal Engineer, AWS Event Management Team)
Combined 20+ years of experience in incident management at AWS
Goal: Share lessons learned, tips, and tricks for incident management and post-incident analysis
Incident Detection and Engagement
Two main categories of event detection:
Service-driven : Alarms on service metrics, subsystem metrics, and synthetic monitoring ("canaries")
Customer-driven : Monitoring for traffic anomalies and customer impact reports
Aggregate alarms that trigger multiple service issues engage a full incident response
Engagement process:
Single-service events : Engage the service team and AWS Support if customers are impacted
Multi-service events : Engage AWS Incident Response (AIR), AWS Support, and "usual suspects" (core services like authentication, DNS, networking)
Incident Coordination and Communication
Technical call: Parallel investigation by all impacted teams to deliver fastest mitigation and resolution
Supported by AWS Incident Response (AIR) team, who own the process, mental models, and tooling
Call led by a senior "call leader" to make critical decisions
Strong etiquette and mindset around what is discussed on the call vs. documented in tickets
Support call: Focuses on customer communication, recovery guidance, and tracking customer sentiment
Leverages data from the technical call to provide real-time updates to customers
Balances speed, accuracy, and depth of communications over time
Mitigation and Resolution
Focus is on mitigating customer impact first, then investigating root cause
Common mitigation techniques:
Shifting traffic away from the failure (e.g. removing instances from load balancer)
Rollbacks of recent deployments
Scaling up resources to handle increased load
Restarting components (as a last resort)
Deploying configuration changes or new software versions (with high confidence)
Risk of recurrence is assessed, and temporary fixes are validated to ensure they are sustainable
Post-Incident Analysis and Lessons Learned
Post-Incident Analysis (COE) document:
Starts with an impact summary and customer experience timeline
Dives into root causes using the "5 Whys" approach to uncover multiple contributing factors
Identifies learnings and action items to prevent recurrence
Balancing speed and quality in COE creation is critical to regain customer trust
Action items are categorized by implementation timeline:
Short-term (hours) to prevent immediate recurrence
Mid-term (days/weeks) to build sustainable solutions
Long-term (months) for systemic changes and reinvention
Learning and Scaling
Team-level education on COEs and operational "tenets" (guiding principles)
Distributed changes via services like Trusted Advisor to scale learnings to customers
Centralized changes through the creation of new AWS services to solve recurring problems
Examples: Route 53, Elastic Load Balancing, AWS Certificate Manager, IAM, STS
Key Takeaways
Engage leaders early and often during incidents
Focus on fast, iterative customer communications during the event
Prioritize mitigation over root cause analysis during the incident
Treat incidents as powerful learning opportunities for the entire organization
Define clear "tenets" to guide decision-making and collaboration
Your Digital Journey deserves a great story. Build one with us.