Anatomy of an AWS Region (ARC204)

Building AWS Regions: An Inside Look

AWS spans 108 availability zones within 34 geographic regions, with plans for 18 more availability zones and 6 new regions.
AWS has different partitions, such as the commercial partition, US GovCloud, China, and the upcoming EU Sovereign Cloud.
Each region has multiple availability zones, which are physically separated by at least 60 miles to prevent correlated failures.
Availability zones within a region are connected with high-bandwidth fiber connections, and transit centers provide connectivity to the global AWS backbone network.

AWS follows a shared responsibility model, where customers must understand the zonal/regional nature of services they use and plan their operational response accordingly.
Customers are encouraged to test their disaster recovery plans through game days, as AWS does for its own region launches.

In the early days, AWS used a "bootstrap ninja" process, where an engineer would manually set up each availability zone by plugging in jump drives with configurations.
This approach was not scalable as AWS continued to innovate and expand its service offerings.
AWS shifted to a parallel approach, where the physical data center construction and software build are done concurrently, using existing regions to bootstrap the new region.

The region build process has two main workstreams: the physical data center construction and the software build.
For the software build, AWS uses a "bootstrap region" to kickstart the process, creating a VPC and initial services like authentication.
AWS then brings up services in waves, with each service checking the health of its dependencies before activating.
This "Static Stability" approach allows services to recover themselves during operational events, reducing manual intervention.

AWS has mechanisms in place to facilitate collaboration across its global teams, including:
- Tickets for self-service problem-solving and issue tracking
- Operational Readiness Reviews (ORs) to assess service readiness
- Correction of Error (CoE) documents for post-incident learning
AWS also has region build-specific tooling, including project management tools for scheduling and dependency tracking, and internal documentation for service teams.

Design for resiliency in your systems, using principles like fault containers and availability zones.
Carefully consider when to centralize versus distribute problem-solving efforts.
Ensure everything is managed as code, including configurations and documentation.
Measure problems and solutions with data to drive decision-making.
Cultivate a culture of leadership engagement and healthy escalation for timely issue resolution.