TalksAWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

Resilience of the AWS Cloud: Design Patterns for Availability

Understanding Resilience

The "resilience equation" encompasses more than just physical infrastructure failures
Resilience has evolved beyond the traditional "monolithic" application model to the modern "microservices" architecture
Failure modes have changed, requiring a more comprehensive view of resilience

AWS Global Infrastructure

AWS regions are composed of multiple Availability Zones (AZs) with low-latency interconnects
AZs are designed to be isolated from environmental risks like earthquakes, flooding, and power outages
Redundant power, networking, and storage provide high availability within each region

Resilience of AWS Services

AWS services like EC2 and S3 are themselves highly resilient, multi-AZ distributed applications
Customers must architect their applications to leverage the resilience of these AWS services
Understanding whether a service is "zonal" (EC2) or "regional" (S3) is key to designing resilient applications

AWS Service Operations

New services undergo a rigorous "Operational Readiness Review" before deployment
Automated deployment pipelines with staged rollouts and "bake time" ensure safe changes
Correction of Error (CoE) process drives continuous improvement and learning from incidents
Weekly "OpsMetrics" calls share best practices and learnings across all service teams

Handling Traffic Surges

"Load shedding" techniques protect core functionality during periods of extreme traffic
Proactive capacity planning and performance testing help avoid overload conditions
Retries and timeouts must be carefully managed to avoid compounding the problem

Security Resilience

AWS has a "culture of security" that shifts security practices as far left as possible
Automated deployment pipelines and incident response processes enable rapid mitigation of security issues
Continuous monitoring and learning help identify and address emerging security risks

Key Takeaways

Resilience in the cloud goes beyond just physical infrastructure, encompassing service design, operations, and security
AWS has built-in resilience mechanisms across its global infrastructure, services, and operational practices
Customers must architect their applications to leverage these resilience capabilities effectively
Detailed operational processes and continuous improvement drive ongoing resilience enhancements

Technical Details

38 AWS regions, 120+ Availability Zones globally
50+ million deployments per year across AWS services
Rigorous "Operational Readiness Review" with capacity, monitoring, and recovery requirements
Automated deployment pipelines with staged rollouts and "bake time"
"Correction of Error" process for incident root cause analysis and continuous improvement
Weekly "OpsMetrics" calls to share best practices across all service teams

Business Impact

Enables mission-critical workloads to run reliably on the AWS Cloud
Provides a resilience framework that customers can apply to their own cloud-based applications
Helps organizations meet regulatory and compliance requirements around resilience and availability
Allows rapid innovation and change without compromising the stability of production systems

Examples

Handling of the Log4j security vulnerability, where AWS was able to remediate the issue globally within 48 hours
Designing resilient applications by leveraging multi-AZ deployments of services like EC2 and S3
Using "load shedding" techniques to protect core functionality during traffic surges

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

Resilience of the AWS Cloud: Design Patterns for Availability

Understanding Resilience

AWS Global Infrastructure

Resilience of AWS Services

AWS Service Operations

Handling Traffic Surges

Security Resilience

Key Takeaways

Technical Details

Business Impact

Examples

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

Resilience of the AWS Cloud: Design Patterns for Availability

Understanding Resilience

AWS Global Infrastructure

Resilience of AWS Services

AWS Service Operations

Handling Traffic Surges

Security Resilience

Key Takeaways

Technical Details

Business Impact

Examples

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.