Talks AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310) VIDEO
AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310) Resilience of the AWS Cloud: Design Patterns for Availability
Understanding Resilience
The "resilience equation" encompasses more than just physical infrastructure failures
Resilience has evolved beyond the traditional "monolithic" application model to the modern "microservices" architecture
Failure modes have changed, requiring a more comprehensive view of resilience
AWS Global Infrastructure
AWS regions are composed of multiple Availability Zones (AZs) with low-latency interconnects
AZs are designed to be isolated from environmental risks like earthquakes, flooding, and power outages
Redundant power, networking, and storage provide high availability within each region
Resilience of AWS Services
AWS services like EC2 and S3 are themselves highly resilient, multi-AZ distributed applications
Customers must architect their applications to leverage the resilience of these AWS services
Understanding whether a service is "zonal" (EC2) or "regional" (S3) is key to designing resilient applications
AWS Service Operations
New services undergo a rigorous "Operational Readiness Review" before deployment
Automated deployment pipelines with staged rollouts and "bake time" ensure safe changes
Correction of Error (CoE) process drives continuous improvement and learning from incidents
Weekly "OpsMetrics" calls share best practices and learnings across all service teams
Handling Traffic Surges
"Load shedding" techniques protect core functionality during periods of extreme traffic
Proactive capacity planning and performance testing help avoid overload conditions
Retries and timeouts must be carefully managed to avoid compounding the problem
Security Resilience
AWS has a "culture of security" that shifts security practices as far left as possible
Automated deployment pipelines and incident response processes enable rapid mitigation of security issues
Continuous monitoring and learning help identify and address emerging security risks
Key Takeaways
Resilience in the cloud goes beyond just physical infrastructure, encompassing service design, operations, and security
AWS has built-in resilience mechanisms across its global infrastructure, services, and operational practices
Customers must architect their applications to leverage these resilience capabilities effectively
Detailed operational processes and continuous improvement drive ongoing resilience enhancements
Technical Details
38 AWS regions, 120+ Availability Zones globally
50+ million deployments per year across AWS services
Rigorous "Operational Readiness Review" with capacity, monitoring, and recovery requirements
Automated deployment pipelines with staged rollouts and "bake time"
"Correction of Error" process for incident root cause analysis and continuous improvement
Weekly "OpsMetrics" calls to share best practices across all service teams
Business Impact
Enables mission-critical workloads to run reliably on the AWS Cloud
Provides a resilience framework that customers can apply to their own cloud-based applications
Helps organizations meet regulatory and compliance requirements around resilience and availability
Allows rapid innovation and change without compromising the stability of production systems
Examples
Handling of the Log4j security vulnerability, where AWS was able to remediate the issue globally within 48 hours
Designing resilient applications by leveraging multi-AZ deployments of services like EC2 and S3
Using "load shedding" techniques to protect core functionality during traffic surges
Your Digital Journey deserves a great story. Build one with us.