Failing without flailing: Lessons we learned at AWS the hard way (ARC333)
Failing Without Flailing: Lessons Learned at AWS the Hard Way
Persistent Connections and Resilience
Resilient systems should embrace and expect failures at all layers
One common issue is how applications manage connections within their systems
Persistent connections can create resilience risks, especially at scale
Connection establishment can get stuck, leading to messages in an indeterminate state
As the number of components increases, the availability of the system decreases exponentially
The hypothesis: An application's resilience is proportional to how often its recovery workflows are executed
This applies to infrastructure replacement, message bus/event bus connections, leader election, and evacuation
Solution: Regularly reestablish connections to force the system to get good at replacing them
Set a connection lifetime to force reconnections, allowing cost amortization while maintaining resilience
Scaling Reliably
The "wrong" way to scale in the cloud: Leave servers running forever without replacement or scaling
The "right" way: Use Auto Scaling, load balancers, and managed storage services
However, some services may need to store state on the host, which complicates scaling
When scaling up these stateful hosts, the time to load the state can be a bottleneck
Story: DNS and EC2 service 10 years ago
Propagation of EC2 instance updates to the DNS hosts had a long lag during deployments
This made it hard to distinguish deployments from actual problems
Optimizations:
Focused on minimizing host startup time to meet customer deadlines
Used techniques like snapshots, cellular architectures, and synthetic traffic to improve deployments
Measuring Bottlenecks
Every system has scaling bottlenecks that need to be identified and addressed
Story: The VPC service and its control plane/data plane architecture
Append-only log used to distribute state from control plane to data plane
Locking used to ensure consistency when appending updates
Discovered that the critical section had grown, leading to a brownout
Measuring the "sum of latency" held by the lock provides an early warning signal
Allows proactive mitigation before hitting the scaling cliff
Uses Little's Law to relate arrival rate, latency, and concurrency
Other techniques:
Posting zero values when metrics are "healthy"
Using dimensional data (e.g., instance IDs) to quickly identify and remove problematic hosts
Regularly restarting services as a quick mitigation
The key takeaways are:
Embrace failures and regularly exercise recovery workflows to build resilience.
Pay attention to scaling bottlenecks, especially for stateful components, and use metrics to identify them early.
Have a toolkit of quick mitigation techniques (like restarts) to quickly recover from issues.
Your Digital Journey deserves a great story.
Build one with us.
This website stores cookies on your computer.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.