Here is a detailed summary of the video transcription, broken down into sections for better readability:
Introduction
- Stephen Clark, Director of Enterprise Support and Technical Account Management at AWS, introduces the session on how Fidelity Investments operates their tier zero trading databases on AWS.
- Many AWS customers are now running their most critical applications on AWS, including real-time applications like stock trading, online banking, and insurance claims processing.
- Disruptions to these critical systems can be costly, with enterprises facing an average of 29 unplanned outages per year, costing $13.5 million on average.
- Customers struggle to identify the root cause of issues and resolve them quickly, often taking 4-5 hours before reaching out to AWS for help.
Fidelity Investments Overview
- Fidelity is a financial services company with 51.5 million customers and 28,000 employees, providing a wide range of services beyond just brokerage and 401(k) management.
- Fidelity embarked on a cloud journey 8 years ago, aiming to use cloud-agnostic technologies and focus on open-source.
- Fidelity has moved 6,000 databases to the cloud and invests heavily in employee learning and upskilling.
Resiliency Patterns
- Fidelity focuses on resiliency, measured through Disaster Recovery capability (RTO, RPO, and availability).
- Fidelity has evaluated different resiliency patterns:
- Active-passive backup/restore: Not suitable for mission-critical applications due to long RTO and potential for data loss.
- Active-passive: Provides some availability but still has limitations around RTO and RPO.
- Active-active: Fidelity's initial approach, with two regions and synchronous replication, but still had potential for data loss.
- Global active-active: Fidelity's current approach, using a distributed relational database with a minimum of 3 nodes across 3 regions, providing strong consistency and zero RPO.
Failure Mode and Effects Analysis (FMEA)
- Fidelity proactively tests their resilient systems by manually injecting failures using AWS Fault Injection Service and a framework called Chaos Buffet.
- This allows Fidelity to understand how their databases and applications behave under different failure scenarios, ensuring no surprises when issues occur in production.
Incident Detection and Response (IDR)
- Despite Fidelity's robust resiliency measures, they still faced challenges in quickly identifying and resolving issues, especially when engaging with AWS support.
- This led Fidelity to adopt the AWS Incident Detection and Response (IDR) service, which provides:
- Improved observability by correlating metrics and alarms between the application and underlying AWS services.
- Faster incident resolution through pre-defined runbooks and single-threaded ownership.
- Early warning of AWS events that may impact Fidelity's critical workloads.
- Continuous improvement through post-incident reviews and resilience evaluations.
Case Studies
- Fidelity shared two case studies of incidents, one before and one after adopting IDR:
- Pre-IDR: A 9-hour triage process involving multiple teams and AWS support.
- Post-IDR: A 1-hour triage process, with AWS engaging within 1 minute and identifying the issue in 42 minutes.
Conclusion
- Fidelity's partnership with AWS and adoption of IDR has significantly improved their ability to detect, respond to, and resolve issues with their mission-critical workloads running on AWS.
- The session emphasizes the importance of proactive testing, observability, and incident management for enterprises running critical applications on the cloud.