Failing without flailing: Lessons we learned at AWS the hard way (ARC333)

Persistent Connections and Resilience

Resilient systems should embrace and expect failures at all layers

One common issue is how applications manage connections within their systems

Persistent connections can create resilience risks, especially at scale

Connection establishment can get stuck, leading to messages in an indeterminate state
As the number of components increases, the availability of the system decreases exponentially

The hypothesis: An application's resilience is proportional to how often its recovery workflows are executed

This applies to infrastructure replacement, message bus/event bus connections, leader election, and evacuation

Solution: Regularly reestablish connections to force the system to get good at replacing them

Set a connection lifetime to force reconnections, allowing cost amortization while maintaining resilience

Scaling Reliably

The "wrong" way to scale in the cloud: Leave servers running forever without replacement or scaling

The "right" way: Use Auto Scaling, load balancers, and managed storage services

However, some services may need to store state on the host, which complicates scaling

When scaling up these stateful hosts, the time to load the state can be a bottleneck

Story: DNS and EC2 service 10 years ago

Propagation of EC2 instance updates to the DNS hosts had a long lag during deployments
This made it hard to distinguish deployments from actual problems

Optimizations:

Focused on minimizing host startup time to meet customer deadlines
Used techniques like snapshots, cellular architectures, and synthetic traffic to improve deployments

Measuring Bottlenecks

Every system has scaling bottlenecks that need to be identified and addressed

Story: The VPC service and its control plane/data plane architecture

Measuring the "sum of latency" held by the lock provides an early warning signal

Other techniques:

Posting zero values when metrics are "healthy"
Using dimensional data (e.g., instance IDs) to quickly identify and remove problematic hosts
Regularly restarting services as a quick mitigation

The key takeaways are:

Embrace failures and regularly exercise recovery workflows to build resilience.

Pay attention to scaling bottlenecks, especially for stateful components, and use metrics to identify them early.

Have a toolkit of quick mitigation techniques (like restarts) to quickly recover from issues.

Failing Without Flailing: Lessons Learned at AWS the Hard Way