Talks AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352) VIDEO
AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352) Comprehensive Summary: Large-scale Software Deployments at Amazon S3
Introduction
Presenters: George Lewis and Wana
Goal: Discuss Amazon S3's best practices for safe and scalable software deployments
Assumptions: Audience has experience as developers and operators, with existing deployment pipelines and infrastructure as code
Deployment Safety Practices
Testing Infrastructure
Noodles : Behavior-driven test platform for S3's public APIs
Allows distributed teams to write and run tests without implementing low-level step definitions
Abstracts and shares test infrastructure (accounts, resources, etc.) across teams
Converts tests to run as on-demand tests and continuous canaries in production
Hi-Fi : Model-based testing system for comprehensive API coverage
Starts with a precise executable model/specification of the API
Test generator systematically explores all possible request combinations based on the model and real customer behavior
Validator service checks service implementations against the model, reporting any deviations
Runs in integration, non-prod regions, and select production regions to catch cross-partition issues
Performance Testing
3 phases: software feature performance, instance performance, and per-region ratings
Measures performance at the micro level, then on representative hardware, and finally in production
Provides data for capacity planning and redundancy decisions based on regional traffic profiles
Blast Radius Containment
S3 web server deployment pipeline:
Pre-production testing (Noodles, Hi-Fi, performance)
Validators (non-prod hosts in each region for canary traffic)
First region deployment (one box, more baking)
US East-1 deployment (largest region, varied traffic patterns)
Exponential fan-out to remaining regions (4 regions, then all remaining)
Continuous monitoring of key metrics and alarms during rollout
Canary clients to detect silent failures before customer impact
Application Feature Controls
Feature Flags : Use AWS AppConfig to dynamically enable/disable features
Allow/Deny Lists : Selectively enable features for internal testing and gradual rollout
Shadow Deployments : Copy production traffic to a parallel service instance for validation without impacting customers
Stateful Deployments for Durable Storage
Data Preservation Considerations
Integrity: Ensure data accuracy and trustworthiness over time
Consistency: Maintain a single, up-to-date view of data across all clients
Resiliency: Ability to recover from data loss or corruption
Durability: Protect data from loss or corruption long-term
Durability Threat Modeling
Identify all possible ways data could be compromised (hardware failures, software bugs, operator errors)
Implement mitigation strategies like redundancy, validation tools, and process automation
Deployment Workflow for Stateful Hosts
Host Reservation System: Coordinates all maintenance activities, enforcing safety checks
Fleet Updater: Orchestrates deployments, patches, and hardware maintenance
Handles both successful and failed reservations, with built-in safety controls
Assumes every deployment will fail, designs for resilience and data restoration
Results and Impact
Operator escalations during deployments reduced from 478 in 2022 to less than 10 in 2025
Techniques enable safe, scalable deployments for S3's massive, globally distributed infrastructure
Key Takeaways
Comprehensive testing (functional, model-based, performance) is crucial for deployment safety
Containing blast radius through gradual rollouts and canary monitoring is essential
Feature controls like flags and shadow deployments provide safety nets for application changes
Stateful deployments require additional considerations for data preservation and durability
Proactive threat modeling and automated safety checks can dramatically improve operational resilience
Your Digital Journey deserves a great story. Build one with us.