Historically, disaster recovery (DR) testing was done annually, with a lot of preparation and effort, but often lacked true preparedness
With cloud and more frequent application changes, annual DR testing is no longer sufficient
Many organizations lack confidence in their ability to fail over and fail back during an outage, fearing they will cause more disruption
Disaster Preparedness Tenets
Identify Risks: Conduct landscape analysis, failure mode and effects analysis (FMEA), and use AWS Resilience Analysis Framework to understand failure probabilities
Create Disaster Plan: Establish plans at the workload, business unit, and enterprise levels, similar to natural disaster preparedness
Build DR Kits: Leverage tools like Amazon Application Recovery Controller and AWS Elastic Disaster Recovery
Practice Disaster Plan: Regularly test failover and failback scenarios, including non-happy path testing
Monitoring and Alerting: Set up dashboards, KPIs, SLIs, and SLOs to detect issues and trigger failover
Fidelity's Disaster Recovery Practices
Fidelity has a long history of technology innovation, with 75% of applications running in the public cloud today
Fidelity takes a structured, multi-tiered approach to resilience testing:
Infrastructure fault testing: Ensure infrastructure can handle faults and respond as expected
Application-level fault testing: Validate applications can handle infrastructure faults
End-to-end application testing: Test complex workflows across business units and dependencies
Fidelity's applications often involve 64 microservices across 10 business units, requiring coordinated testing and recovery
Fidelity uses common recovery paths and tooling, like AWS Application Recovery Controller (ARC), to enable consistent and measurable recovery
Fidelity regularly practices recovery scenarios, both in planned tests and during real-world events, to continuously improve
Key Learnings from October 2022 Outage
During the October 2022 outage, Fidelity was able to recover 2,000 applications in 9 minutes by having a well-practiced recovery plan
Coordination with third-party providers is crucial - understand their disaster recovery plans and have them integrated into your own
Once a regional event is detected, it's important to stay in the alternate region until the all-clear is given, even if improvements are being made in the primary region
AWS Resilience Services
Fault Injection Service (FIS)
Allows injecting controlled failures into applications to test resilience
Supports over 50 actions across compute, storage, networking, and databases
Provides curated failure scenarios, like power interruption and gray failures, to test specific failure modes
Can be used to test single accounts or across multiple accounts
Application Recovery Controller (ARC)
Fully managed multi-region recovery orchestration service
Replicates recovery plans to regional data planes, ensuring no dependencies on the region being left
Supports nested plans to coordinate recovery across multiple microservices and business units
Provides dashboards to monitor and manage recovery executions
Automatically evaluates plans every 30 minutes to ensure they can still be executed as expected
Design resilience tests using FIS to inject failures
Identify key metrics and alarms to measure success criteria
Implement recovery procedures, such as using ARC's region switch plans
Execute tests, monitor results, and continuously improve
Conclusion
Disaster preparedness and resilience testing are critical for mission-critical applications, especially in the cloud
Fidelity's structured, multi-tiered approach to resilience testing and recovery has enabled them to respond effectively to real-world outages
AWS provides tools like Fault Injection Service and Application Recovery Controller to help organizations build, test, and automate their multi-region disaster recovery strategies
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.