TalksAWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)

AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)

Leveraging AI for Resilience Testing and Disaster Recovery

Resilience Lifecycle Framework

The resilience lifecycle framework consists of the following stages:
- Set Objectives: Define recovery time objective (RTO), recovery point objective (RPO), and service level agreements (SLAs)
- Design & Implement: Use tools like AWS Kira and Kuro CLI to build resilient infrastructure and applications
- Evaluate & Test: Discover potential failure scenarios and validate resilience through controlled experiments
- Operate: Use tools like AWS CloudWatch Investigator to monitor and respond to incidents
- Learn & Respond: Analyze past incidents and create automated tests to prevent recurrence

Discovering Failure Scenarios with Generative AI

Inventory Agent

The inventory agent is responsible for cataloging the software, configurations, services, and versions installed on online EC2 instances reporting to AWS Systems Manager.
Key capabilities:
- Focuses only on business-critical applications, ignoring patches and updates
- Avoids disrupting protected services during testing
- Provides a detailed inventory of the application stack, including dependencies

Hypothesis Generation

The inventory agent's findings are used to generate hypotheses about potential failure scenarios.
The agent considers factors like the server's primary role, active services, and dependencies to identify likely points of failure.
This provides a comprehensive understanding of the application architecture and potential weaknesses.

Automation Document Generation

Based on the discovered failure scenarios, the automation agent generates AWS Systems Manager documents to facilitate controlled experiments.
Key design principles:
- Ensure state restoration and idempotency to avoid unintended impacts
- Implement preconditions to validate the target environment
- Provide logging and failure handling for observability

Validating Known Failure Modes with AI

Leveraging Past Incidents

The team can provide the agent with details of past incidents, including root cause analysis and timelines.
The agent can then use this information to recreate the failure scenarios and validate that the implemented resilience measures are effective.

Comprehensive Testing in Production-like Environments

The agent can be used to test in production-like environments, ensuring that the resilience measures work as expected.
This includes validating disaster recovery (DR) capabilities and the team's ability to observe and respond to the simulated incidents.

Accelerating Resilience Testing with AI

Multi-Agent Chaos Engineering

The presentation outlines a vision for a multi-agent system that can:
- Generate hypotheses about potential failure scenarios
- Prioritize the most likely and impactful scenarios
- Design and execute controlled experiments
- Evaluate the effectiveness of the resilience measures

Key Takeaways

AI can significantly accelerate the resilience testing process, reducing the time from weeks to days.
By automating the discovery, hypothesis generation, and experiment design, teams can focus on validating and iterating on the resilience measures.
The collaboration between AI and human experts ensures that the resilience testing is comprehensive, controlled, and aligned with business objectives.

Implementing AI-Powered Resilience Testing

The presentation mentions the availability of a "Multi-Agent Chaos Engineering" solution in AWS that implements the concepts discussed.
Additionally, the AWS Resilience Analyst Framework and AWS Fault Isolation Boundaries are mentioned as resources for further guidance on resilience planning and design.

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)

Leveraging AI for Resilience Testing and Disaster Recovery

Resilience Lifecycle Framework

Discovering Failure Scenarios with Generative AI

Inventory Agent

Hypothesis Generation

Automation Document Generation

Validating Known Failure Modes with AI

Leveraging Past Incidents

Comprehensive Testing in Production-like Environments

Accelerating Resilience Testing with AI

Multi-Agent Chaos Engineering

Key Takeaways

Implementing AI-Powered Resilience Testing

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)

Leveraging AI for Resilience Testing and Disaster Recovery

Resilience Lifecycle Framework

Discovering Failure Scenarios with Generative AI

Inventory Agent

Hypothesis Generation

Automation Document Generation

Validating Known Failure Modes with AI

Leveraging Past Incidents

Comprehensive Testing in Production-like Environments

Accelerating Resilience Testing with AI

Multi-Agent Chaos Engineering

Key Takeaways

Implementing AI-Powered Resilience Testing

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.