Fidelity Investments: Building for mission-critical resilience (FSI318)

The Enterprise Resilience Challenge

In 2019, Fidelity Investments migrated its core order management system to AWS to address new market dynamics and customer expectations, including:

Increased trade value and volume due to fractional trades and zero fees
Sophisticated and powerful trading devices for customers
Hybrid cloud environment with on-premises and cloud-based systems

Shifting Left in the Resilience Cycle

Fidelity faced challenges in managing cloud capacity, defining scaling policies, maintaining low latency, and ensuring rapid development with serverless and microservices.

The team realized that simply testing and operating in production was not enough - they needed to shift left and proactively understand their system's failure modes and dependencies.

Chaos Engineering as a Core Practice

Fidelity adopted a Failure Mode and Effects Analysis (FMEA) approach to systematically identify potential failure modes in their system and prioritize them based on severity, occurrence, and detection.

This FMEA analysis provided a roadmap for Fidelity's chaos engineering program, allowing them to design and execute experiments to understand their system's behavior under failure conditions.

Fidelity's Chaos Engineering Platform: "Chaos Buffet"

Fidelity built a platform called "Chaos Buffet" to abstract and automate the execution of chaos engineering experiments across their applications.

The platform allowed development teams to easily configure and run chaos experiments, leveraging tools like AWS Fault Injection Service (FIS) and AWS System Manager (SSM).

FIS was integrated to natively support Lambda fault injection, addressing a key need identified by Fidelity.

Learnings and Outcomes

Fidelity's chaos engineering efforts led to improvements in several areas:

Surviving failures: Identifying and addressing gaps in retry logic, timeouts, and scaling policies.
System performance: Optimizing memory allocations and scaling to match demand.
Reducing impact: Improving observability and reducing mean time to resolution (MTTR).

The chaos engineering program achieved significant results, with a 100% increase in chaos coverage and a 50% reduction in MTTR.

Fidelity is now focused on expanding chaos engineering depth and breadth, as well as democratizing the practice across the organization.

Conclusion

Chaos engineering is not just for the "tech giants" - it is a critical practice for any organization running mission-critical applications, especially in highly regulated industries like finance.

By proactively understanding and addressing their system's failure modes, Fidelity has been able to build more resilient applications and improve their overall operational efficiency.

The partnership between Fidelity and AWS in developing the Fault Injection Service demonstrates the value of collaboration in advancing chaos engineering practices.

Fidelity Investments: Building for mission-critical resilience (FSI318)

Building Resilient Applications at Scale using Chaos Engineering

The Enterprise Resilience Challenge

Shifting Left in the Resilience Cycle

Chaos Engineering as a Core Practice

Fidelity's Chaos Engineering Platform: "Chaos Buffet"

Learnings and Outcomes

Conclusion

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Fidelity Investments: Building for mission-critical resilience (FSI318)

Building Resilient Applications at Scale using Chaos Engineering

The Enterprise Resilience Challenge

Shifting Left in the Resilience Cycle

Chaos Engineering as a Core Practice

Fidelity's Chaos Engineering Platform: "Chaos Buffet"

Learnings and Outcomes

Conclusion

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.