AWS spans 108 availability zones within 34 geographic regions, with plans for 18 more availability zones and 6 new regions.
AWS has different partitions, such as the commercial partition, US GovCloud, China, and the upcoming EU Sovereign Cloud.
Each region has multiple availability zones, which are physically separated by at least 60 miles to prevent correlated failures.
Availability zones within a region are connected with high-bandwidth fiber connections, and transit centers provide connectivity to the global AWS backbone network.
Designing for Resiliency
AWS follows a shared responsibility model, where customers must understand the zonal/regional nature of services they use and plan their operational response accordingly.
Customers are encouraged to test their disaster recovery plans through game days, as AWS does for its own region launches.
Evolution of Region Building
In the early days, AWS used a "bootstrap ninja" process, where an engineer would manually set up each availability zone by plugging in jump drives with configurations.
This approach was not scalable as AWS continued to innovate and expand its service offerings.
AWS shifted to a parallel approach, where the physical data center construction and software build are done concurrently, using existing regions to bootstrap the new region.
Building Regions Today
The region build process has two main workstreams: the physical data center construction and the software build.
For the software build, AWS uses a "bootstrap region" to kickstart the process, creating a VPC and initial services like authentication.
AWS then brings up services in waves, with each service checking the health of its dependencies before activating.
This "Static Stability" approach allows services to recover themselves during operational events, reducing manual intervention.
AWS Builders and Region Build Tooling
AWS has mechanisms in place to facilitate collaboration across its global teams, including:
Tickets for self-service problem-solving and issue tracking
Operational Readiness Reviews (ORs) to assess service readiness
Correction of Error (CoE) documents for post-incident learning
AWS also has region build-specific tooling, including project management tools for scheduling and dependency tracking, and internal documentation for service teams.
Key Takeaways
Design for resiliency in your systems, using principles like fault containers and availability zones.
Carefully consider when to centralize versus distribute problem-solving efforts.
Ensure everything is managed as code, including configurations and documentation.
Measure problems and solutions with data to drive decision-making.
Cultivate a culture of leadership engagement and healthy escalation for timely issue resolution.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.