AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

Designing for Availability in Amazon S3

Defining Availability and Failure

Availability is about dealing with failure in a storage service like Amazon S3

Failure can occur at various levels:

Individual hard drive failures
Server failures
Rack failures
Availability zone outages

S3 manages tens of millions of hard drives across millions of servers in 120 availability zones across 38 regions

Failures can be permanent (component loss) or transient (power, networking, overload issues)

S3's Design Goals for Availability

S3 is designed for 99.99% availability and 11 9s of durability

Prior to 2020, S3 did not provide read-after-write consistency, so violating consistency was an acceptable way to deal with failures

The key was to have multiple servers that could handle requests, with allowance for some failures

Quorum-Based Indexing System

S3's indexing subsystem holds object metadata and is accessed for every data plane request

The indexing data is stored across a set of replicas using a quorum-based algorithm:

Servers are spread across separate availability zones to avoid correlated failures
Reads and writes only require a majority of servers to succeed, allowing for some failures

This quorum-based design provides high availability, as long as a majority of servers are available

Caching and Consistency Issues

S3 heavily caches frequently accessed objects in the front-end servers

However, this caching design led to inconsistent reads, as reads and writes were not overlapping

Reads could return stale data if they hit a different cache node than the most recent write

Implementing Read-After-Write Consistency

To provide read-after-write consistency, S3 introduced a replicated journal and a witness system:

The journal establishes a well-defined ordering of mutations
The witness tracks the high watermark for writes to the index

This allows the cache nodes to determine if a cached value is stale and needs to be fetched from storage

Maintaining Availability with Consistency

The journal-based design initially lost the failure allowance provided by the quorum-based system

To regain this, S3 introduced dynamic reconfiguration of the journal nodes:

Nodes monitor each other's availability and trigger reconfigurations when failures are detected
The configuration system itself is quorum-based, ensuring high availability

Failure Handling at the Implementation Level

Correlated failures, where multiple components fail together, are a key concern for availability

Examples include entire racks, availability zones, or software deployments failing together

S3 designs for this by exposing workloads to multiple failure domains, ensuring data is replicated across different failure boundaries

Failure modes include fail-stop (complete failure) and "gray" failures (partial failures, overload, etc.)

Techniques used to handle gray failures include retries, timeouts, and queue management to avoid "congestive collapse"

Automatic Healing and Health Monitoring

S3 uses health checks from multiple perspectives (regions, internet) to get a holistic view of system health

Local decisions about component health are avoided, as a global rate limiter coordinates remediation actions

This ensures that a failing health check service itself does not cause widespread damage by making incorrect local decisions

Key Takeaways

Availability is designed into S3 at the system architecture and implementation levels

Quorum-based algorithms, replicated journals, and dynamic reconfiguration are used to maintain availability

Handling correlated failures and "gray" failures are key challenges, addressed through replication, retries, timeouts, and global coordination

Automatic healing and health monitoring, avoiding local decisions, are critical for a highly available, self-healing system at scale

AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

Designing for Availability in Amazon S3

Defining Availability and Failure

S3's Design Goals for Availability

Quorum-Based Indexing System

Caching and Consistency Issues

Implementing Read-After-Write Consistency

Maintaining Availability with Consistency

Failure Handling at the Implementation Level

Automatic Healing and Health Monitoring

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

Designing for Availability in Amazon S3

Defining Availability and Failure

S3's Design Goals for Availability

Quorum-Based Indexing System

Caching and Consistency Issues

Implementing Read-After-Write Consistency

Maintaining Availability with Consistency

Failure Handling at the Implementation Level

Automatic Healing and Health Monitoring

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.