TalksAWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

Comprehensive Summary: Large-scale Software Deployments at Amazon S3

Introduction

  • Presenters: George Lewis and Wana
  • Goal: Discuss Amazon S3's best practices for safe and scalable software deployments
  • Assumptions: Audience has experience as developers and operators, with existing deployment pipelines and infrastructure as code

Deployment Safety Practices

Testing Infrastructure

  • Noodles: Behavior-driven test platform for S3's public APIs
    • Allows distributed teams to write and run tests without implementing low-level step definitions
    • Abstracts and shares test infrastructure (accounts, resources, etc.) across teams
    • Converts tests to run as on-demand tests and continuous canaries in production
  • Hi-Fi: Model-based testing system for comprehensive API coverage
    • Starts with a precise executable model/specification of the API
    • Test generator systematically explores all possible request combinations based on the model and real customer behavior
    • Validator service checks service implementations against the model, reporting any deviations
    • Runs in integration, non-prod regions, and select production regions to catch cross-partition issues
  • Performance Testing
    • 3 phases: software feature performance, instance performance, and per-region ratings
    • Measures performance at the micro level, then on representative hardware, and finally in production
    • Provides data for capacity planning and redundancy decisions based on regional traffic profiles

Blast Radius Containment

  • S3 web server deployment pipeline:
    • Pre-production testing (Noodles, Hi-Fi, performance)
    • Validators (non-prod hosts in each region for canary traffic)
    • First region deployment (one box, more baking)
    • US East-1 deployment (largest region, varied traffic patterns)
    • Exponential fan-out to remaining regions (4 regions, then all remaining)
    • Continuous monitoring of key metrics and alarms during rollout
    • Canary clients to detect silent failures before customer impact

Application Feature Controls

  • Feature Flags: Use AWS AppConfig to dynamically enable/disable features
  • Allow/Deny Lists: Selectively enable features for internal testing and gradual rollout
  • Shadow Deployments: Copy production traffic to a parallel service instance for validation without impacting customers

Stateful Deployments for Durable Storage

Data Preservation Considerations

  • Integrity: Ensure data accuracy and trustworthiness over time
  • Consistency: Maintain a single, up-to-date view of data across all clients
  • Resiliency: Ability to recover from data loss or corruption
  • Durability: Protect data from loss or corruption long-term

Durability Threat Modeling

  • Identify all possible ways data could be compromised (hardware failures, software bugs, operator errors)
  • Implement mitigation strategies like redundancy, validation tools, and process automation

Deployment Workflow for Stateful Hosts

  • Host Reservation System: Coordinates all maintenance activities, enforcing safety checks
  • Fleet Updater: Orchestrates deployments, patches, and hardware maintenance
    • Handles both successful and failed reservations, with built-in safety controls
  • Assumes every deployment will fail, designs for resilience and data restoration

Results and Impact

  • Operator escalations during deployments reduced from 478 in 2022 to less than 10 in 2025
  • Techniques enable safe, scalable deployments for S3's massive, globally distributed infrastructure

Key Takeaways

  • Comprehensive testing (functional, model-based, performance) is crucial for deployment safety
  • Containing blast radius through gradual rollouts and canary monitoring is essential
  • Feature controls like flags and shadow deployments provide safety nets for application changes
  • Stateful deployments require additional considerations for data preservation and durability
  • Proactive threat modeling and automated safety checks can dramatically improve operational resilience

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.