TalksAWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

Comprehensive Summary: Large-scale Software Deployments at Amazon S3

Introduction

Presenters: George Lewis and Wana
Goal: Discuss Amazon S3's best practices for safe and scalable software deployments
Assumptions: Audience has experience as developers and operators, with existing deployment pipelines and infrastructure as code

Deployment Safety Practices

Testing Infrastructure

Noodles: Behavior-driven test platform for S3's public APIs
- Allows distributed teams to write and run tests without implementing low-level step definitions
- Abstracts and shares test infrastructure (accounts, resources, etc.) across teams
- Converts tests to run as on-demand tests and continuous canaries in production
Hi-Fi: Model-based testing system for comprehensive API coverage
- Starts with a precise executable model/specification of the API
- Test generator systematically explores all possible request combinations based on the model and real customer behavior
- Validator service checks service implementations against the model, reporting any deviations
- Runs in integration, non-prod regions, and select production regions to catch cross-partition issues
Performance Testing
- 3 phases: software feature performance, instance performance, and per-region ratings
- Measures performance at the micro level, then on representative hardware, and finally in production
- Provides data for capacity planning and redundancy decisions based on regional traffic profiles

Blast Radius Containment

S3 web server deployment pipeline:
- Pre-production testing (Noodles, Hi-Fi, performance)
- Validators (non-prod hosts in each region for canary traffic)
- First region deployment (one box, more baking)
- US East-1 deployment (largest region, varied traffic patterns)
- Exponential fan-out to remaining regions (4 regions, then all remaining)
- Continuous monitoring of key metrics and alarms during rollout
- Canary clients to detect silent failures before customer impact

Application Feature Controls

Feature Flags: Use AWS AppConfig to dynamically enable/disable features
Allow/Deny Lists: Selectively enable features for internal testing and gradual rollout
Shadow Deployments: Copy production traffic to a parallel service instance for validation without impacting customers

Stateful Deployments for Durable Storage

Data Preservation Considerations

Integrity: Ensure data accuracy and trustworthiness over time
Consistency: Maintain a single, up-to-date view of data across all clients
Resiliency: Ability to recover from data loss or corruption
Durability: Protect data from loss or corruption long-term

Durability Threat Modeling

Identify all possible ways data could be compromised (hardware failures, software bugs, operator errors)
Implement mitigation strategies like redundancy, validation tools, and process automation

Deployment Workflow for Stateful Hosts

Host Reservation System: Coordinates all maintenance activities, enforcing safety checks
Fleet Updater: Orchestrates deployments, patches, and hardware maintenance
- Handles both successful and failed reservations, with built-in safety controls
Assumes every deployment will fail, designs for resilience and data restoration

Results and Impact

Operator escalations during deployments reduced from 478 in 2022 to less than 10 in 2025
Techniques enable safe, scalable deployments for S3's massive, globally distributed infrastructure

Key Takeaways

Comprehensive testing (functional, model-based, performance) is crucial for deployment safety
Containing blast radius through gradual rollouts and canary monitoring is essential
Feature controls like flags and shadow deployments provide safety nets for application changes
Stateful deployments require additional considerations for data preservation and durability
Proactive threat modeling and automated safety checks can dramatically improve operational resilience

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

Comprehensive Summary: Large-scale Software Deployments at Amazon S3

Introduction

Deployment Safety Practices

Testing Infrastructure

Blast Radius Containment

Application Feature Controls

Stateful Deployments for Durable Storage

Data Preservation Considerations

Durability Threat Modeling

Deployment Workflow for Stateful Hosts

Results and Impact

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

Comprehensive Summary: Large-scale Software Deployments at Amazon S3

Introduction

Deployment Safety Practices

Testing Infrastructure

Blast Radius Containment

Application Feature Controls

Stateful Deployments for Durable Storage

Data Preservation Considerations

Durability Threat Modeling

Deployment Workflow for Stateful Hosts

Results and Impact

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.