Talks How Stripe achieves five and a half 9s of availability (FSI321) VIDEO
How Stripe achieves five and a half 9s of availability (FSI321) Achieving 5 and a Half 9s of Availability at Stripe
Importance of Reliability
Reliable systems are critical, as failure can have significant consequences
Example: London fire department unable to respond to a fire due to a telephone system failure
Reliability is about the worst-case scenario, not just the average or best-case
Stripe needs to achieve 5 and a half 9s of availability (13 seconds of unavailability per month) as they power critical financial transactions
Technical Strategies for Reliability
Reducing Blast Radius :
Failures are inevitable, so focus on reducing the impact of failures
Use cell-based architectures to isolate failures within individual cells
Roll out changes incrementally to limit the scope of any potential issues
Detecting Gray Failures :
Gray failures are partial or intermittent issues that are harder to detect
Use anomaly detection to identify nodes with higher-than-normal latency and remove them from serving traffic
Avoid letting nodes self-report on their own health
Leveraging Cloud Partner Capabilities :
When migrating to EBS, Stripe worked closely with AWS to develop new features and metrics to aid in diagnosis and remediation of issues
Chaos Testing and Fault Injection :
Purposefully introduce faults into the system to test automated remediation capabilities
Start with a few faults and gradually increase as the system matures
Load Testing :
Conduct load tests at many times the peak user traffic to identify weaknesses
Use these tests to optimize resource usage and scale efficiently
Continuous Delivery :
Automated deployments reduce the chance of human error and enable faster response to issues
Frequent, small deployments make it easier to identify the root cause of any problems
Retries :
Implement retries at the outermost layer closest to the user
Perform retries within Stripe's network to minimize latency impact
Proactive User Alerting :
Monitor for spikes in error rates and latency, even in "successful" responses
Reach out to users to help optimize their integrations and reduce their load on Stripe's systems
Building a Culture of Reliability
Practice Your Worst Day Every Day :
Continuously test systems to their limits and design for failure
Invest in observability and self-healing capabilities
Never Send a Human to Do a Machine's Job :
Rely on automation to detect and mitigate failures within the tight availability budget
Empower machines to react in milliseconds, not minutes
Exercise Extreme Ownership :
Hold leaders accountable for the reliability and availability of the systems they own
Recognize and reward those who prioritize reliability and drive continuous improvement
Key Takeaways:
Practice your worst-case scenarios regularly to build resilient systems.
Leverage automation and machines to detect and respond to failures quickly.
Foster a culture of reliability through accountability, ownership, and continuous improvement.
Your Digital Journey deserves a great story. Build one with us.