Try again: The tools and techniques behind resilient systems (ARC403)

Here is a detailed summary of the key takeaways from the video transcript, broken down into sections:


  • Jitter (adding randomness) is a good practice when implementing retries, as it helps break up spikes in load.
  • Naive retries with exponential backoff can actually introduce "metastable" or tipping point failures, where the system gets knocked into a state it cannot recover from.
  • Using a token bucket algorithm, like the one in the AWS SDK, can avoid this failure mode for most systems with just a few lines of client-side code.
  • Backoff is more effective in "closed" systems (fixed set of workers) than "open" systems (random arrivals) in reducing load.

Circuit Breakers

  • Circuit breakers are a powerful tool, but have downsides:
    • Can reduce availability in sharded systems by turning partial failures into full failures.
    • Can expand "blast radius" in microservice architectures by turning off healthy services.
  • Avoid binary on/off circuit breakers, prefer more adaptive algorithms like additive increase/multiplicative decrease.
  • Circuit breakers may need to know internal details of the downstream service to make good decisions, breaking layering.

Tail Latency

  • Techniques like hedging (sending two requests, using the first response), erasure coding, and adaptive retries can flatten the tail latency distribution.
  • Erasure coding is a powerful but underutilized technique, providing constant-work resilience to slow or failed requests.
  • Simulation is a valuable tool for understanding and tuning the tail latency behavior of a system, without requiring advanced statistical skills.

System Stability

  • Many systems exhibit "metastable" behavior, where adding load can push the system into a state it cannot recover from.
  • Retries and other "best practices" can exacerbate this, leading to long, painful outages.
  • Simulation is a great way to explore the stability and recovery properties of a system, without relying solely on intuition.

Statistics and Simulation

  • The speaker advocates for more use of simulation techniques by engineers, as they fit the mental model of code review and testing better than advanced statistics.
  • Generative AI tools like Sonet 35 can further simplify the process of building and running simulations.
  • The key is to write more simulations to explore the behavior of your systems, rather than relying solely on intuition or complex statistical analysis.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us