Talks AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415) VIDEO
AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415) Driving Operational Excellence and Reliability at AWS
Defining Operational Excellence
Operational excellence is not about perfection, but about striving for the highest standards
It involves a balance between speed (bias for action) and quality (insisting on high standards)
Operational excellence accepts that mistakes will happen, but focuses on learning from them
Architectural Choices for Reliability
Redundancy and fault isolation at the infrastructure level
Multiple data centers, availability zones, and coffee shops to ensure redundancy
Dependency isolation in API services
Separate thread pools per dependency to limit blast radius
Cellular architecture for the data plane
Multiple copies of stacks routed through a thin layer to reduce impact of issues
Investing in Operational Excellence
Operational excellence is a key feature, not an afterthought, for AWS
It is an intentional, systematic process, not just good intentions
The Operational Excellence Flywheel
Observability :
Instrumenting services to collect metrics, logs, and traces
Standardizing observability through libraries and tools like Embedded Metric Format (EMF)
Using Cloudwatch as the primary observability platform
Incident Response :
Maintaining standard operating procedures and runbooks
Automating runbooks and escalation processes
Incorporating AI and ML to aid in incident response
Readiness :
Operational Readiness Reviews with checklists and bar raisers
Extensive testing, including failure scenarios and game days
Change management and release excellence processes
Reviews :
Weekly dashboard reviews to identify anomalies and trends
Reviewing high-severity incidents and tickets for recurring problems
Conducting detailed "Correction of Error" (COE) reports after major incidents
Empowering Developers with Observability Tools
Cloudwatch MCP Server and Application Signals MCP Server integrate observability directly into IDEs
Allows developers to access SLOs, investigate issues, and get AI-driven root cause analysis without leaving their development environment
Automating Incident Investigation with Cloudwatch Investigations
Automatically collects and analyzes data from CloudTrail, Cloudwatch, and other sources to identify root causes
Provides a detailed investigation report with hypotheses, timelines, and recommendations for improvement
Can be integrated with ticketing systems to provide real-time updates and insights
Fostering a Culture of Operational Excellence
Encouraging a blame-free, learning-focused approach to incident reviews
Scaling operational excellence through processes like the weekly dashboard review meetings
Continuously improving processes by feeding learnings back into operational readiness reviews and other mechanisms
Key Takeaways
AWS invests heavily in operational excellence as a core feature, not an afterthought
Operational excellence is driven by a systematic, mechanism-based approach, not just good intentions
Observability, incident response, readiness, and reviews are the key drivers of the operational excellence flywheel
Empowering developers with integrated observability tools and automating incident investigation are crucial for scalable operations
Fostering a culture of learning and continuous improvement is essential for sustained operational excellence
Your Digital Journey deserves a great story. Build one with us.