TalksAWS re:Invent 2025 - Building at global scale: Engineering AWS expansion (ARC312)

AWS re:Invent 2025 - Building at global scale: Engineering AWS expansion (ARC312)

Building at Global Scale: Engineering AWS Expansion

AWS Global Infrastructure Overview

  • AWS currently has 38 regions launched, with more in development (EU sovereign cloud, Chile, Kingdom of Saudi Arabia)
  • Regions are the highest level of abstraction, comprising 3+ availability zones
  • Availability zones are isolated data centers within a region, designed for high resiliency
  • Local zones are extensions of availability zones, providing low-latency compute and storage closer to end-users

Resilient Architecture

  • Understanding the scope of services is key for building resilient systems
    • Zonal services (e.g. EC2) vs. regional services (e.g. S3, DynamoDB)
    • Leveraging multiple availability zones within a region can provide sufficient resiliency for most use cases
    • Multi-region strategies add complexity but may be required for certain workloads
  • AWS focuses on building resilience into its services from the ground up

Evolving Region Build Processes

  • Early region builds involved manually configuring each availability zone ("region bootstrap ninjas")
    • This was error-prone and difficult to scale as the global footprint expanded
  • Modern region builds leverage a "bootstrap region" to parallelize the physical and software build processes
    • The bootstrap region is used to pre-build core services and infrastructure
    • This allows the new region to be launched more quickly by migrating the pre-built components

Dependency Management Challenges

  • The AWS service ecosystem has grown extremely complex, with hundreds of interdependent services
  • Attempting to map and orchestrate all dependencies is impractical due to the dynamic nature of the system
  • AWS leverages "static stability" to enable services to recover gracefully without relying on perfect dependency resolution

Continuous Improvement and Testing

  • AWS conducts regular "game day" exercises to test failure scenarios and validate resilience
  • The "Correction of Errors" (COE) process is used to thoroughly investigate incidents and drive systemic improvements
  • Operational Readiness Reviews (ORRs) ensure services are operationally healthy before launch

Key Takeaways

  • Architect services to be aware of their zonal/regional scope and dependencies
  • Leverage infrastructure-as-code to automate configuration and deployment
  • Embrace a culture of continuous improvement, with rapid feedback loops
  • Empower engineers to escalate issues and drive systemic changes
  • Test extensively to validate resilience and uncover hidden dependencies

Additional Resources

  • AWS Builder Library article on "Building Resilient Services"
  • AWS Builder Library article on "Static Stability"

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.