TalksAWS re:Invent 2025 - Multi-Region disaster recovery & resilience testing (feat. Fidelity) (COP358)

AWS re:Invent 2025 - Multi-Region disaster recovery & resilience testing (feat. Fidelity) (COP358)

Multi-Region Disaster Recovery & Resilience Testing

Disaster Recovery and Preparedness

  • Historically, disaster recovery (DR) testing was done annually, with a lot of preparation and effort, but often lacked true preparedness
  • With cloud and more frequent application changes, annual DR testing is no longer sufficient
  • Many organizations lack confidence in their ability to fail over and fail back during an outage, fearing they will cause more disruption

Disaster Preparedness Tenets

  1. Identify Risks: Conduct landscape analysis, failure mode and effects analysis (FMEA), and use AWS Resilience Analysis Framework to understand failure probabilities
  2. Create Disaster Plan: Establish plans at the workload, business unit, and enterprise levels, similar to natural disaster preparedness
  3. Build DR Kits: Leverage tools like Amazon Application Recovery Controller and AWS Elastic Disaster Recovery
  4. Practice Disaster Plan: Regularly test failover and failback scenarios, including non-happy path testing
  5. Monitoring and Alerting: Set up dashboards, KPIs, SLIs, and SLOs to detect issues and trigger failover

Fidelity's Disaster Recovery Practices

  • Fidelity has a long history of technology innovation, with 75% of applications running in the public cloud today
  • Fidelity takes a structured, multi-tiered approach to resilience testing:
    1. Infrastructure fault testing: Ensure infrastructure can handle faults and respond as expected
    2. Application-level fault testing: Validate applications can handle infrastructure faults
    3. End-to-end application testing: Test complex workflows across business units and dependencies
  • Fidelity's applications often involve 64 microservices across 10 business units, requiring coordinated testing and recovery
  • Fidelity uses common recovery paths and tooling, like AWS Application Recovery Controller (ARC), to enable consistent and measurable recovery
  • Fidelity regularly practices recovery scenarios, both in planned tests and during real-world events, to continuously improve

Key Learnings from October 2022 Outage

  • During the October 2022 outage, Fidelity was able to recover 2,000 applications in 9 minutes by having a well-practiced recovery plan
  • Coordination with third-party providers is crucial - understand their disaster recovery plans and have them integrated into your own
  • Once a regional event is detected, it's important to stay in the alternate region until the all-clear is given, even if improvements are being made in the primary region

AWS Resilience Services

Fault Injection Service (FIS)

  • Allows injecting controlled failures into applications to test resilience
  • Supports over 50 actions across compute, storage, networking, and databases
  • Provides curated failure scenarios, like power interruption and gray failures, to test specific failure modes
  • Can be used to test single accounts or across multiple accounts

Application Recovery Controller (ARC)

  • Fully managed multi-region recovery orchestration service
  • Replicates recovery plans to regional data planes, ensuring no dependencies on the region being left
  • Supports nested plans to coordinate recovery across multiple microservices and business units
  • Provides dashboards to monitor and manage recovery executions
  • Automatically evaluates plans every 30 minutes to ensure they can still be executed as expected

Key Resilience Testing Approach

  1. Define resilience objectives (e.g., validate multi-region RTO)
  2. Design resilience tests using FIS to inject failures
  3. Identify key metrics and alarms to measure success criteria
  4. Implement recovery procedures, such as using ARC's region switch plans
  5. Execute tests, monitor results, and continuously improve

Conclusion

  • Disaster preparedness and resilience testing are critical for mission-critical applications, especially in the cloud
  • Fidelity's structured, multi-tiered approach to resilience testing and recovery has enabled them to respond effectively to real-world outages
  • AWS provides tools like Fault Injection Service and Application Recovery Controller to help organizations build, test, and automate their multi-region disaster recovery strategies

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.