AWS re:Invent 2025 - DynamoDB: Resilience & lessons from the Oct 2025 service disruption (DAT453)

AWS re:Invent 2025 - DynamoDB: Resilience & Lessons from the Oct 2025 Service Disruption

Priorities for AWS Services

Security, durability, availability, and latency are the priorities for AWS services like DynamoDB, in that order.

The focus is on the things that won't change in the next 10 years, as those are the areas that require the most effort.

The October 2020 DynamoDB Outage

On October 20th, 2025, DynamoDB did not meet its availability standards in the US East-1 region.

Customers noticed the impact, as DynamoDB returned zero IP addresses from DNS, preventing customers from connecting.

A formal postmortem process, called a Correction of Error (COE), was conducted to learn from the incident.

How DynamoDB Uses DNS

DynamoDB's use of DNS has become more complex as the service has scaled.

Examples of increased complexity include transitioning from a single load balancer to multiple load balancers, and mixing different instance types.

DynamoDB now has hundreds of load balancers per region, each with hundreds of instances, requiring sophisticated DNS management.

DynamoDB's DNS Management System

DynamoDB's DNS management system has two components: a planner and an executor (enactor).

The planner analyzes the state of the region and generates a plan for the desired DNS configuration.

The enactor consumes the plan and configures Route 53 accordingly.

To coordinate the enactors and avoid race conditions, a locking mechanism using Route 53 text records was implemented.

The Root Cause of the Outage

The root cause was a race condition between installing a new DNS plan and cleaning up old DNS records.

Due to an "unusual delay" in one of the enactors acquiring the lock, it fell behind and installed an older, stale DNS plan.

This caused an inconsistent state where the active DNS record was deleted, but the rollback record was not updated, preventing the enactor from making progress.

Mitigating the Outage

The impact began at 6:48 UTC and was resolved by 9:25 UTC, when healthy DNS records were restored.

Individual customers started observing the restored DNS between 9:25 UTC and 9:40 UTC, due to DNS caching.

By 12:38 UTC, the automated DNS management system was disabled across all regions to contain the impact.

The fix was deployed to the first region by October 22nd and to all regions by October 28th.

Key Lessons Learned

Analyzing Complex Events: Using tools like Distributed Tracing (e.g., X-Ray) can help speed up the analysis of complex, timing-related issues by identifying the sequence of events.

Testing Timing-Related Issues: Traditional automated testing may not be sufficient to catch timing-related bugs. Formal modeling and verification can be more effective.

Cellular Architecture: Implementing a cellular architecture, where each cell has its own independent DNS management, can help reduce the blast radius of regional issues.

Separating Fast and Slow Paths: Distinguishing between fast, emergency-driven DNS updates and slower, routine updates can improve resilience, but requires careful consideration of the tradeoffs.

Freezing the System in Place: Having the ability to freeze the system in a known state, such as by disabling the automated DNS management, can be a valuable tool during outages.

Leveraging Beneficial Races: Not all races are bad - some patterns, like request hedging, can actually improve resilience by smoothing out tail latency.

Dependency Management: Carefully examining dependencies, both strong and weak, and architecting the system accordingly can enhance overall resilience.

Business Impact and Real-World Applications

The October 2025 DynamoDB outage had a significant impact on customers, as DynamoDB is a critical service relied upon by many AWS users.

The lessons learned from this incident can be applied to improve the resilience of other distributed, highly-scalable systems, not just DynamoDB.

Techniques like formal modeling, cellular architecture, and separating fast and slow paths can be valuable in a wide range of enterprise-grade applications that require high availability and low latency.

The importance of comprehensive logging, tracing, and analysis tools, as well as the ability to freeze the system in a known state, are applicable lessons for any complex, mission-critical system.

Specific Examples and Results

DynamoDB handles hundreds of tables driving over 500,000 requests per second, with Amazon's own Prime Day workloads reaching 151 million requests per second.

The root cause analysis identified a race condition between installing a new DNS plan and cleaning up old records, which led to an inconsistent state that required manual intervention to resolve.

The fix for the issue was deployed to the first region within 2 days and to all regions within 1 week, demonstrating the importance of a robust and automated deployment process.

Formal modeling and verification, using tools like TLA+, helped the DynamoDB team validate the correctness of their fixes and gain confidence in the changes before re-enabling the automated DNS management system.

AWS re:Invent 2025 - DynamoDB: Resilience & lessons from the Oct 2025 service disruption (DAT453)