TalksAWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

Designing for Availability in Amazon S3

Defining Availability and Failure

  • Availability is about dealing with failure in a storage service like Amazon S3
  • Failure can occur at various levels:
    • Individual hard drive failures
    • Server failures
    • Rack failures
    • Availability zone outages
  • S3 manages tens of millions of hard drives across millions of servers in 120 availability zones across 38 regions
  • Failures can be permanent (component loss) or transient (power, networking, overload issues)

S3's Design Goals for Availability

  • S3 is designed for 99.99% availability and 11 9s of durability
  • Prior to 2020, S3 did not provide read-after-write consistency, so violating consistency was an acceptable way to deal with failures
  • The key was to have multiple servers that could handle requests, with allowance for some failures

Quorum-Based Indexing System

  • S3's indexing subsystem holds object metadata and is accessed for every data plane request
  • The indexing data is stored across a set of replicas using a quorum-based algorithm:
    • Servers are spread across separate availability zones to avoid correlated failures
    • Reads and writes only require a majority of servers to succeed, allowing for some failures
  • This quorum-based design provides high availability, as long as a majority of servers are available

Caching and Consistency Issues

  • S3 heavily caches frequently accessed objects in the front-end servers
  • However, this caching design led to inconsistent reads, as reads and writes were not overlapping
  • Reads could return stale data if they hit a different cache node than the most recent write

Implementing Read-After-Write Consistency

  • To provide read-after-write consistency, S3 introduced a replicated journal and a witness system:
    • The journal establishes a well-defined ordering of mutations
    • The witness tracks the high watermark for writes to the index
  • This allows the cache nodes to determine if a cached value is stale and needs to be fetched from storage

Maintaining Availability with Consistency

  • The journal-based design initially lost the failure allowance provided by the quorum-based system
  • To regain this, S3 introduced dynamic reconfiguration of the journal nodes:
    • Nodes monitor each other's availability and trigger reconfigurations when failures are detected
    • The configuration system itself is quorum-based, ensuring high availability

Failure Handling at the Implementation Level

  • Correlated failures, where multiple components fail together, are a key concern for availability
    • Examples include entire racks, availability zones, or software deployments failing together
  • S3 designs for this by exposing workloads to multiple failure domains, ensuring data is replicated across different failure boundaries
  • Failure modes include fail-stop (complete failure) and "gray" failures (partial failures, overload, etc.)
  • Techniques used to handle gray failures include retries, timeouts, and queue management to avoid "congestive collapse"

Automatic Healing and Health Monitoring

  • S3 uses health checks from multiple perspectives (regions, internet) to get a holistic view of system health
  • Local decisions about component health are avoided, as a global rate limiter coordinates remediation actions
  • This ensures that a failing health check service itself does not cause widespread damage by making incorrect local decisions

Key Takeaways

  • Availability is designed into S3 at the system architecture and implementation levels
  • Quorum-based algorithms, replicated journals, and dynamic reconfiguration are used to maintain availability
  • Handling correlated failures and "gray" failures are key challenges, addressed through replication, retries, timeouts, and global coordination
  • Automatic healing and health monitoring, avoiding local decisions, are critical for a highly available, self-healing system at scale

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.