TalksAWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

AWS re:Invent 2025 - Behind the scenes: How AWS drives operational excellence & reliability (COP415)

Driving Operational Excellence and Reliability at AWS

Defining Operational Excellence

  • Operational excellence is not about perfection, but about striving for the highest standards
  • It involves a balance between speed (bias for action) and quality (insisting on high standards)
  • Operational excellence accepts that mistakes will happen, but focuses on learning from them

Architectural Choices for Reliability

  • Redundancy and fault isolation at the infrastructure level
    • Multiple data centers, availability zones, and coffee shops to ensure redundancy
  • Dependency isolation in API services
    • Separate thread pools per dependency to limit blast radius
  • Cellular architecture for the data plane
    • Multiple copies of stacks routed through a thin layer to reduce impact of issues

Investing in Operational Excellence

  • Operational excellence is a key feature, not an afterthought, for AWS
  • It is an intentional, systematic process, not just good intentions

The Operational Excellence Flywheel

  1. Observability:
    • Instrumenting services to collect metrics, logs, and traces
    • Standardizing observability through libraries and tools like Embedded Metric Format (EMF)
    • Using Cloudwatch as the primary observability platform
  2. Incident Response:
    • Maintaining standard operating procedures and runbooks
    • Automating runbooks and escalation processes
    • Incorporating AI and ML to aid in incident response
  3. Readiness:
    • Operational Readiness Reviews with checklists and bar raisers
    • Extensive testing, including failure scenarios and game days
    • Change management and release excellence processes
  4. Reviews:
    • Weekly dashboard reviews to identify anomalies and trends
    • Reviewing high-severity incidents and tickets for recurring problems
    • Conducting detailed "Correction of Error" (COE) reports after major incidents

Empowering Developers with Observability Tools

  • Cloudwatch MCP Server and Application Signals MCP Server integrate observability directly into IDEs
  • Allows developers to access SLOs, investigate issues, and get AI-driven root cause analysis without leaving their development environment

Automating Incident Investigation with Cloudwatch Investigations

  • Automatically collects and analyzes data from CloudTrail, Cloudwatch, and other sources to identify root causes
  • Provides a detailed investigation report with hypotheses, timelines, and recommendations for improvement
  • Can be integrated with ticketing systems to provide real-time updates and insights

Fostering a Culture of Operational Excellence

  • Encouraging a blame-free, learning-focused approach to incident reviews
  • Scaling operational excellence through processes like the weekly dashboard review meetings
  • Continuously improving processes by feeding learnings back into operational readiness reviews and other mechanisms

Key Takeaways

  • AWS invests heavily in operational excellence as a core feature, not an afterthought
  • Operational excellence is driven by a systematic, mechanism-based approach, not just good intentions
  • Observability, incident response, readiness, and reviews are the key drivers of the operational excellence flywheel
  • Empowering developers with integrated observability tools and automating incident investigation are crucial for scalable operations
  • Fostering a culture of learning and continuous improvement is essential for sustained operational excellence

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.