TalksAWS re:Invent 2025 - Building on AWS resilience: Innovations for critical success (ARC207)

AWS re:Invent 2025 - Building on AWS resilience: Innovations for critical success (ARC207)

Innovating for Resilience: Behind the Scenes at AWS

Regional Isolation for AWS Services

  • AWS has migrated almost all existing global AWS Security Token Service (STS) traffic to be served locally within each AWS region.
  • This was done to strengthen the regional isolation of customer workloads and improve reliability and performance:
    • STS was originally a global service, with a single endpoint in US-East-1 serving all regions.
    • This introduced cross-region dependencies, reducing the regional isolation that AWS aims to provide.
    • By deploying dedicated STS endpoints in each region, requests are now answered locally, eliminating the need for cross-region calls.
  • Key results:
    • Thousands of customer accounts that previously depended on cross-region STS calls are now served locally with no changes required.
    • Latency for STS calls dropped from up to 230ms down to 20ms at the 99th percentile.
  • This change was implemented with near-perfect transparency for customers, using techniques like DNS-based routing and preserving CloudTrail logging behavior.

Resilience to Availability Zone Impairments

  • AWS has built services to detect and automatically respond to "gray failures" - situations where a server or availability zone is degraded but not completely failed.
    • Simple health checks are not sufficient, as they can't distinguish between a completely failed host and one that is performing poorly.
    • AWS developed "deep health checks" that analyze host-level metrics and stack-rank instances to identify outliers.
  • The Lambda service implemented a "Fleet Health Service" that:
    • Analyzes CloudWatch logs to detect instances responsible for a disproportionate share of errors.
    • Automatically marks those instances as unhealthy, allowing auto-scaling to replace them.
    • This service has been adopted across many other AWS services.
  • AWS also built a "Zonal Event Detector" to identify when an entire availability zone is experiencing issues.
    • This powers a "Zonal Shift" feature that allows automatically or manually shifting traffic away from an impaired zone.
    • AWS offers this "Zonal Shift" capability to customers for free, allowing them to use the same tools as AWS to respond to AZ impairments.
  • The "Zonal Auto Shift" feature takes this a step further, automatically detecting and responding to zonal impairments on behalf of customers.

Rigorous Testing and "Game Days"

  • AWS has a culture of "if you haven't tested it in the past week, it's probably broken."
  • To validate resilience, AWS runs regular "game day" exercises, simulating various failure scenarios in a dedicated test AWS region.
    • This allows continuous testing without impacting production environments or customers.
    • The test region is treated the same as a production region by AWS service teams.
  • Key elements of the game day testing approach:
    • Mix of standard benchmark tests (e.g. simulating AZ failures) and more novel failure scenarios.
    • Validation that fixes for past incidents work as expected.
    • Providing hands-on incident response training for on-call engineers.
  • This allows AWS to continuously expand its "competence envelope" - the set of known operating conditions the systems can handle reliably.

Protecting Against Metastable Failures

  • AWS has studied a particular type of overload issue called "metastable failures":
    • Systems that appear stable under normal load, but can transition into a self-sustaining failure state when hit with a load spike.
    • Examples often involve queuing systems where retries and wasted work create a feedback loop.
  • To proactively identify systems vulnerable to metastable failures, AWS has developed a multi-step strategy:
    • Statistical modeling to quickly explore the parameter space of queue lengths, retries, etc.
    • Lightweight simulation and emulation to validate potential problem areas.
    • Finally, testing with the real application to confirm issues.
  • This approach, including an open-source "Metaphor" tool, allows AWS to efficiently map out the risk of metastable failures across its services.

Key Takeaways

  • AWS is constantly innovating behind the scenes to improve the resilience of its services and customer workloads.
  • This includes transparently migrating global services like STS to be region-isolated, building automated detection and recovery for "gray failures", and rigorously testing resilience at scale.
  • AWS is also proactively researching and addressing more complex resilience challenges like metastable failures.
  • These innovations help make AWS the best platform to run mission-critical workloads, with higher reliability and less operational burden for customers.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.