Fidelity Investments: Building for mission-critical resilience (FSI318)

Building Resilient Applications at Scale using Chaos Engineering

The Enterprise Resilience Challenge

  • In 2019, Fidelity Investments migrated its core order management system to AWS to address new market dynamics and customer expectations, including:
    • Increased trade value and volume due to fractional trades and zero fees
    • Sophisticated and powerful trading devices for customers
    • Hybrid cloud environment with on-premises and cloud-based systems

Shifting Left in the Resilience Cycle

  • Fidelity faced challenges in managing cloud capacity, defining scaling policies, maintaining low latency, and ensuring rapid development with serverless and microservices.
  • The team realized that simply testing and operating in production was not enough - they needed to shift left and proactively understand their system's failure modes and dependencies.

Chaos Engineering as a Core Practice

  • Fidelity adopted a Failure Mode and Effects Analysis (FMEA) approach to systematically identify potential failure modes in their system and prioritize them based on severity, occurrence, and detection.
  • This FMEA analysis provided a roadmap for Fidelity's chaos engineering program, allowing them to design and execute experiments to understand their system's behavior under failure conditions.

Fidelity's Chaos Engineering Platform: "Chaos Buffet"

  • Fidelity built a platform called "Chaos Buffet" to abstract and automate the execution of chaos engineering experiments across their applications.
  • The platform allowed development teams to easily configure and run chaos experiments, leveraging tools like AWS Fault Injection Service (FIS) and AWS System Manager (SSM).
  • FIS was integrated to natively support Lambda fault injection, addressing a key need identified by Fidelity.

Learnings and Outcomes

  • Fidelity's chaos engineering efforts led to improvements in several areas:
    1. Surviving failures: Identifying and addressing gaps in retry logic, timeouts, and scaling policies.
    2. System performance: Optimizing memory allocations and scaling to match demand.
    3. Reducing impact: Improving observability and reducing mean time to resolution (MTTR).
  • The chaos engineering program achieved significant results, with a 100% increase in chaos coverage and a 50% reduction in MTTR.
  • Fidelity is now focused on expanding chaos engineering depth and breadth, as well as democratizing the practice across the organization.

Conclusion

  • Chaos engineering is not just for the "tech giants" - it is a critical practice for any organization running mission-critical applications, especially in highly regulated industries like finance.
  • By proactively understanding and addressing their system's failure modes, Fidelity has been able to build more resilient applications and improve their overall operational efficiency.
  • The partnership between Fidelity and AWS in developing the Fault Injection Service demonstrates the value of collaboration in advancing chaos engineering practices.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us