TalksAWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

Resilience of the AWS Cloud: Design Patterns for Availability

Understanding Resilience

  • The "resilience equation" encompasses more than just physical infrastructure failures
  • Resilience has evolved beyond the traditional "monolithic" application model to the modern "microservices" architecture
  • Failure modes have changed, requiring a more comprehensive view of resilience

AWS Global Infrastructure

  • AWS regions are composed of multiple Availability Zones (AZs) with low-latency interconnects
  • AZs are designed to be isolated from environmental risks like earthquakes, flooding, and power outages
  • Redundant power, networking, and storage provide high availability within each region

Resilience of AWS Services

  • AWS services like EC2 and S3 are themselves highly resilient, multi-AZ distributed applications
  • Customers must architect their applications to leverage the resilience of these AWS services
  • Understanding whether a service is "zonal" (EC2) or "regional" (S3) is key to designing resilient applications

AWS Service Operations

  • New services undergo a rigorous "Operational Readiness Review" before deployment
  • Automated deployment pipelines with staged rollouts and "bake time" ensure safe changes
  • Correction of Error (CoE) process drives continuous improvement and learning from incidents
  • Weekly "OpsMetrics" calls share best practices and learnings across all service teams

Handling Traffic Surges

  • "Load shedding" techniques protect core functionality during periods of extreme traffic
  • Proactive capacity planning and performance testing help avoid overload conditions
  • Retries and timeouts must be carefully managed to avoid compounding the problem

Security Resilience

  • AWS has a "culture of security" that shifts security practices as far left as possible
  • Automated deployment pipelines and incident response processes enable rapid mitigation of security issues
  • Continuous monitoring and learning help identify and address emerging security risks

Key Takeaways

  • Resilience in the cloud goes beyond just physical infrastructure, encompassing service design, operations, and security
  • AWS has built-in resilience mechanisms across its global infrastructure, services, and operational practices
  • Customers must architect their applications to leverage these resilience capabilities effectively
  • Detailed operational processes and continuous improvement drive ongoing resilience enhancements

Technical Details

  • 38 AWS regions, 120+ Availability Zones globally
  • 50+ million deployments per year across AWS services
  • Rigorous "Operational Readiness Review" with capacity, monitoring, and recovery requirements
  • Automated deployment pipelines with staged rollouts and "bake time"
  • "Correction of Error" process for incident root cause analysis and continuous improvement
  • Weekly "OpsMetrics" calls to share best practices across all service teams

Business Impact

  • Enables mission-critical workloads to run reliably on the AWS Cloud
  • Provides a resilience framework that customers can apply to their own cloud-based applications
  • Helps organizations meet regulatory and compliance requirements around resilience and availability
  • Allows rapid innovation and change without compromising the stability of production systems

Examples

  • Handling of the Log4j security vulnerability, where AWS was able to remediate the issue globally within 48 hours
  • Designing resilient applications by leveraging multi-AZ deployments of services like EC2 and S3
  • Using "load shedding" techniques to protect core functionality during traffic surges

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.