Talks AWS re:Invent 2025 - Building at global scale: Engineering AWS expansion (ARC312) VIDEO
AWS re:Invent 2025 - Building at global scale: Engineering AWS expansion (ARC312) Building at Global Scale: Engineering AWS Expansion
AWS Global Infrastructure Overview
AWS currently has 38 regions launched, with more in development (EU sovereign cloud, Chile, Kingdom of Saudi Arabia)
Regions are the highest level of abstraction, comprising 3+ availability zones
Availability zones are isolated data centers within a region, designed for high resiliency
Local zones are extensions of availability zones, providing low-latency compute and storage closer to end-users
Resilient Architecture
Understanding the scope of services is key for building resilient systems
Zonal services (e.g. EC2) vs. regional services (e.g. S3, DynamoDB)
Leveraging multiple availability zones within a region can provide sufficient resiliency for most use cases
Multi-region strategies add complexity but may be required for certain workloads
AWS focuses on building resilience into its services from the ground up
Evolving Region Build Processes
Early region builds involved manually configuring each availability zone ("region bootstrap ninjas")
This was error-prone and difficult to scale as the global footprint expanded
Modern region builds leverage a "bootstrap region" to parallelize the physical and software build processes
The bootstrap region is used to pre-build core services and infrastructure
This allows the new region to be launched more quickly by migrating the pre-built components
Dependency Management Challenges
The AWS service ecosystem has grown extremely complex, with hundreds of interdependent services
Attempting to map and orchestrate all dependencies is impractical due to the dynamic nature of the system
AWS leverages "static stability" to enable services to recover gracefully without relying on perfect dependency resolution
Continuous Improvement and Testing
AWS conducts regular "game day" exercises to test failure scenarios and validate resilience
The "Correction of Errors" (COE) process is used to thoroughly investigate incidents and drive systemic improvements
Operational Readiness Reviews (ORRs) ensure services are operationally healthy before launch
Key Takeaways
Architect services to be aware of their zonal/regional scope and dependencies
Leverage infrastructure-as-code to automate configuration and deployment
Embrace a culture of continuous improvement, with rapid feedback loops
Empower engineers to escalate issues and drive systemic changes
Test extensively to validate resilience and uncover hidden dependencies
Additional Resources
AWS Builder Library article on "Building Resilient Services"
AWS Builder Library article on "Static Stability"
Your Digital Journey deserves a great story. Build one with us.