Talks AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407) VIDEO
AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407) Designing for Availability in Amazon S3
Defining Availability and Failure
Availability is about dealing with failure in a storage service like Amazon S3
Failure can occur at various levels:
Individual hard drive failures
Server failures
Rack failures
Availability zone outages
S3 manages tens of millions of hard drives across millions of servers in 120 availability zones across 38 regions
Failures can be permanent (component loss) or transient (power, networking, overload issues)
S3's Design Goals for Availability
S3 is designed for 99.99% availability and 11 9s of durability
Prior to 2020, S3 did not provide read-after-write consistency, so violating consistency was an acceptable way to deal with failures
The key was to have multiple servers that could handle requests, with allowance for some failures
Quorum-Based Indexing System
S3's indexing subsystem holds object metadata and is accessed for every data plane request
The indexing data is stored across a set of replicas using a quorum-based algorithm:
Servers are spread across separate availability zones to avoid correlated failures
Reads and writes only require a majority of servers to succeed, allowing for some failures
This quorum-based design provides high availability, as long as a majority of servers are available
Caching and Consistency Issues
S3 heavily caches frequently accessed objects in the front-end servers
However, this caching design led to inconsistent reads, as reads and writes were not overlapping
Reads could return stale data if they hit a different cache node than the most recent write
Implementing Read-After-Write Consistency
To provide read-after-write consistency, S3 introduced a replicated journal and a witness system:
The journal establishes a well-defined ordering of mutations
The witness tracks the high watermark for writes to the index
This allows the cache nodes to determine if a cached value is stale and needs to be fetched from storage
Maintaining Availability with Consistency
The journal-based design initially lost the failure allowance provided by the quorum-based system
To regain this, S3 introduced dynamic reconfiguration of the journal nodes:
Nodes monitor each other's availability and trigger reconfigurations when failures are detected
The configuration system itself is quorum-based, ensuring high availability
Failure Handling at the Implementation Level
Correlated failures, where multiple components fail together, are a key concern for availability
Examples include entire racks, availability zones, or software deployments failing together
S3 designs for this by exposing workloads to multiple failure domains, ensuring data is replicated across different failure boundaries
Failure modes include fail-stop (complete failure) and "gray" failures (partial failures, overload, etc.)
Techniques used to handle gray failures include retries, timeouts, and queue management to avoid "congestive collapse"
Automatic Healing and Health Monitoring
S3 uses health checks from multiple perspectives (regions, internet) to get a holistic view of system health
Local decisions about component health are avoided, as a global rate limiter coordinates remediation actions
This ensures that a failing health check service itself does not cause widespread damage by making incorrect local decisions
Key Takeaways
Availability is designed into S3 at the system architecture and implementation levels
Quorum-based algorithms, replicated journals, and dynamic reconfiguration are used to maintain availability
Handling correlated failures and "gray" failures are key challenges, addressed through replication, retries, timeouts, and global coordination
Automatic healing and health monitoring, avoiding local decisions, are critical for a highly available, self-healing system at scale
Your Digital Journey deserves a great story. Build one with us.