Failing without flailing: Lessons we learned at AWS the hard way (ARC333)

Failing Without Flailing: Lessons Learned at AWS the Hard Way

Persistent Connections and Resilience

  • Resilient systems should embrace and expect failures at all layers
  • One common issue is how applications manage connections within their systems
  • Persistent connections can create resilience risks, especially at scale
    • Connection establishment can get stuck, leading to messages in an indeterminate state
    • As the number of components increases, the availability of the system decreases exponentially
  • The hypothesis: An application's resilience is proportional to how often its recovery workflows are executed
    • This applies to infrastructure replacement, message bus/event bus connections, leader election, and evacuation
  • Solution: Regularly reestablish connections to force the system to get good at replacing them
    • Set a connection lifetime to force reconnections, allowing cost amortization while maintaining resilience

Scaling Reliably

  • The "wrong" way to scale in the cloud: Leave servers running forever without replacement or scaling
  • The "right" way: Use Auto Scaling, load balancers, and managed storage services
  • However, some services may need to store state on the host, which complicates scaling
    • When scaling up these stateful hosts, the time to load the state can be a bottleneck
  • Story: DNS and EC2 service 10 years ago
    • Propagation of EC2 instance updates to the DNS hosts had a long lag during deployments
    • This made it hard to distinguish deployments from actual problems
  • Optimizations:
    • Focused on minimizing host startup time to meet customer deadlines
    • Used techniques like snapshots, cellular architectures, and synthetic traffic to improve deployments

Measuring Bottlenecks

  • Every system has scaling bottlenecks that need to be identified and addressed
  • Story: The VPC service and its control plane/data plane architecture
    • Append-only log used to distribute state from control plane to data plane
    • Locking used to ensure consistency when appending updates
    • Discovered that the critical section had grown, leading to a brownout
  • Measuring the "sum of latency" held by the lock provides an early warning signal
    • Allows proactive mitigation before hitting the scaling cliff
    • Uses Little's Law to relate arrival rate, latency, and concurrency
  • Other techniques:
    • Posting zero values when metrics are "healthy"
    • Using dimensional data (e.g., instance IDs) to quickly identify and remove problematic hosts
    • Regularly restarting services as a quick mitigation

The key takeaways are:

  1. Embrace failures and regularly exercise recovery workflows to build resilience.
  2. Pay attention to scaling bottlenecks, especially for stateful components, and use metrics to identify them early.
  3. Have a toolkit of quick mitigation techniques (like restarts) to quickly recover from issues.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.