How Stripe achieves five and a half 9s of availability (FSI321)

Achieving 5 and a Half 9s of Availability at Stripe

Importance of Reliability

  • Reliable systems are critical, as failure can have significant consequences
  • Example: London fire department unable to respond to a fire due to a telephone system failure
  • Reliability is about the worst-case scenario, not just the average or best-case
  • Stripe needs to achieve 5 and a half 9s of availability (13 seconds of unavailability per month) as they power critical financial transactions

Technical Strategies for Reliability

  1. Reducing Blast Radius:

    • Failures are inevitable, so focus on reducing the impact of failures
    • Use cell-based architectures to isolate failures within individual cells
    • Roll out changes incrementally to limit the scope of any potential issues
  2. Detecting Gray Failures:

    • Gray failures are partial or intermittent issues that are harder to detect
    • Use anomaly detection to identify nodes with higher-than-normal latency and remove them from serving traffic
    • Avoid letting nodes self-report on their own health
  3. Leveraging Cloud Partner Capabilities:

    • When migrating to EBS, Stripe worked closely with AWS to develop new features and metrics to aid in diagnosis and remediation of issues
  4. Chaos Testing and Fault Injection:

    • Purposefully introduce faults into the system to test automated remediation capabilities
    • Start with a few faults and gradually increase as the system matures
  5. Load Testing:

    • Conduct load tests at many times the peak user traffic to identify weaknesses
    • Use these tests to optimize resource usage and scale efficiently
  6. Continuous Delivery:

    • Automated deployments reduce the chance of human error and enable faster response to issues
    • Frequent, small deployments make it easier to identify the root cause of any problems
  7. Retries:

    • Implement retries at the outermost layer closest to the user
    • Perform retries within Stripe's network to minimize latency impact
  8. Proactive User Alerting:

    • Monitor for spikes in error rates and latency, even in "successful" responses
    • Reach out to users to help optimize their integrations and reduce their load on Stripe's systems

Building a Culture of Reliability

  1. Practice Your Worst Day Every Day:

    • Continuously test systems to their limits and design for failure
    • Invest in observability and self-healing capabilities
  2. Never Send a Human to Do a Machine's Job:

    • Rely on automation to detect and mitigate failures within the tight availability budget
    • Empower machines to react in milliseconds, not minutes
  3. Exercise Extreme Ownership:

    • Hold leaders accountable for the reliability and availability of the systems they own
    • Recognize and reward those who prioritize reliability and drive continuous improvement

Key Takeaways:

  1. Practice your worst-case scenarios regularly to build resilient systems.
  2. Leverage automation and machines to detect and respond to failures quickly.
  3. Foster a culture of reliability through accountability, ownership, and continuous improvement.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us