AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

Driving Operational Excellence and Reliability at AWS

Defining Operational Excellence

Operational excellence is not about perfection, but about striving for the highest standards

It involves a balance between speed (bias for action) and quality (insisting on high standards)

Operational excellence accepts that mistakes will happen, but focuses on learning from them

Architectural Choices for Reliability

Redundancy and fault isolation at the infrastructure level

Multiple data centers, availability zones, and coffee shops to ensure redundancy

Dependency isolation in API services

Separate thread pools per dependency to limit blast radius

Cellular architecture for the data plane

Multiple copies of stacks routed through a thin layer to reduce impact of issues

Investing in Operational Excellence

Operational excellence is a key feature, not an afterthought, for AWS

It is an intentional, systematic process, not just good intentions

The Operational Excellence Flywheel

Observability:

Instrumenting services to collect metrics, logs, and traces
Standardizing observability through libraries and tools like Embedded Metric Format (EMF)
Using Cloudwatch as the primary observability platform

Incident Response:

Maintaining standard operating procedures and runbooks
Automating runbooks and escalation processes
Incorporating AI and ML to aid in incident response

Readiness:

Operational Readiness Reviews with checklists and bar raisers
Extensive testing, including failure scenarios and game days
Change management and release excellence processes

Reviews:

Weekly dashboard reviews to identify anomalies and trends
Reviewing high-severity incidents and tickets for recurring problems
Conducting detailed "Correction of Error" (COE) reports after major incidents

Empowering Developers with Observability Tools

Cloudwatch MCP Server and Application Signals MCP Server integrate observability directly into IDEs

Allows developers to access SLOs, investigate issues, and get AI-driven root cause analysis without leaving their development environment

Automating Incident Investigation with Cloudwatch Investigations

Automatically collects and analyzes data from CloudTrail, Cloudwatch, and other sources to identify root causes

Provides a detailed investigation report with hypotheses, timelines, and recommendations for improvement

Can be integrated with ticketing systems to provide real-time updates and insights

Fostering a Culture of Operational Excellence

Encouraging a blame-free, learning-focused approach to incident reviews

Scaling operational excellence through processes like the weekly dashboard review meetings

Continuously improving processes by feeding learnings back into operational readiness reviews and other mechanisms

Key Takeaways

AWS invests heavily in operational excellence as a core feature, not an afterthought

Operational excellence is driven by a systematic, mechanism-based approach, not just good intentions

Observability, incident response, readiness, and reviews are the key drivers of the operational excellence flywheel

Empowering developers with integrated observability tools and automating incident investigation are crucial for scalable operations

Fostering a culture of learning and continuous improvement is essential for sustained operational excellence

AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

Driving Operational Excellence and Reliability at AWS

Defining Operational Excellence

Architectural Choices for Reliability

Investing in Operational Excellence

The Operational Excellence Flywheel

Empowering Developers with Observability Tools

Automating Incident Investigation with Cloudwatch Investigations

Fostering a Culture of Operational Excellence

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

Driving Operational Excellence and Reliability at AWS

Defining Operational Excellence

Architectural Choices for Reliability

Investing in Operational Excellence

The Operational Excellence Flywheel

Empowering Developers with Observability Tools

Automating Incident Investigation with Cloudwatch Investigations

Fostering a Culture of Operational Excellence

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.