How Stripe achieves five and a half 9s of availability (FSI321)

Achieving 5 and a Half 9s of Availability at Stripe

Importance of Reliability

Reliable systems are critical, as failure can have significant consequences

Example: London fire department unable to respond to a fire due to a telephone system failure

Reliability is about the worst-case scenario, not just the average or best-case

Stripe needs to achieve 5 and a half 9s of availability (13 seconds of unavailability per month) as they power critical financial transactions

Technical Strategies for Reliability

Reducing Blast Radius:

Failures are inevitable, so focus on reducing the impact of failures
Use cell-based architectures to isolate failures within individual cells
Roll out changes incrementally to limit the scope of any potential issues

Detecting Gray Failures:

Gray failures are partial or intermittent issues that are harder to detect
Use anomaly detection to identify nodes with higher-than-normal latency and remove them from serving traffic
Avoid letting nodes self-report on their own health

Leveraging Cloud Partner Capabilities:

When migrating to EBS, Stripe worked closely with AWS to develop new features and metrics to aid in diagnosis and remediation of issues

Chaos Testing and Fault Injection:

Purposefully introduce faults into the system to test automated remediation capabilities
Start with a few faults and gradually increase as the system matures

Load Testing:

Conduct load tests at many times the peak user traffic to identify weaknesses
Use these tests to optimize resource usage and scale efficiently

Continuous Delivery:

Automated deployments reduce the chance of human error and enable faster response to issues
Frequent, small deployments make it easier to identify the root cause of any problems

Retries:

Implement retries at the outermost layer closest to the user
Perform retries within Stripe's network to minimize latency impact

Proactive User Alerting:

Monitor for spikes in error rates and latency, even in "successful" responses
Reach out to users to help optimize their integrations and reduce their load on Stripe's systems

Building a Culture of Reliability

Practice Your Worst Day Every Day:

Continuously test systems to their limits and design for failure
Invest in observability and self-healing capabilities

Never Send a Human to Do a Machine's Job:

Rely on automation to detect and mitigate failures within the tight availability budget
Empower machines to react in milliseconds, not minutes

Exercise Extreme Ownership:

Hold leaders accountable for the reliability and availability of the systems they own
Recognize and reward those who prioritize reliability and drive continuous improvement

Key Takeaways:

Practice your worst-case scenarios regularly to build resilient systems.

Leverage automation and machines to detect and respond to failures quickly.

Foster a culture of reliability through accountability, ownership, and continuous improvement.

How Stripe achieves five and a half 9s of availability (FSI321)

Achieving 5 and a Half 9s of Availability at Stripe

Importance of Reliability

Technical Strategies for Reliability

Building a Culture of Reliability

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

How Stripe achieves five and a half 9s of availability (FSI321)

Achieving 5 and a Half 9s of Availability at Stripe

Importance of Reliability

Technical Strategies for Reliability

Building a Culture of Reliability

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.