TalksAWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience
AWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience
Leveraging Generative AI to Improve Workload Resilience in the Cloud
Understanding the Importance of Resilience
The presentation emphasizes the critical need for resilience in cloud-based workloads, especially for humanitarian and public sector organizations.
The speaker cites the famous quote from Amazon CTO Verner Fogle, "Everything fails all the time," and explains the importance of planning for failure to ensure that workloads remain operational.
Humanitarian disasters and emergencies are becoming more frequent, and organizations need their systems to be reliable and available when they are needed the most.
The Five Principles of Cloud Resilience
The speaker outlines five key principles to consider when building resilient cloud-based workloads:
Single Points of Failure: Ensuring redundancy and avoiding single points of failure in the architecture.
Excessive Load: Ensuring the system has enough resources to handle expected and unexpected loads.
Excessive Latency: Designing the system to handle latency in dependencies and downstream components.
Misconfiguration and Bugs: Implementing robust CI/CD processes and automation to prevent manual errors and ensure consistent deployments.
Shared Fate: Reducing the blast radius of failures by avoiding tight coupling and shared dependencies between workloads.
The speaker introduces the acronym "SEAMS" to help remember these five principles.
Leveraging Generative AI for Resilience Assessment
The speaker demonstrates a custom "Agentic Resilience Advisor" built using generative AI and the Strands SDK.
This advisor can analyze a specific workload running in the speaker's AWS account and assess its resilience across the five SEAMS principles.
The advisor uses various tools, including an "AWS Use" tool to inspect the resources, a "Calculate Letter Grade" tool to provide a simple resilience score, and the AWS documentation MCP server to retrieve relevant information.
The advisor can adjust the resilience assessment based on changes to the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for the workload.
Improving Observability and Incident Response
The speaker showcases how the Agentic Resilience Advisor can provide specific recommendations to improve the observability of the workload, such as:
Implementing comprehensive CloudWatch alarms and dashboards
Enabling X-Ray for distributed tracing
Setting up enhanced logging and log insights
The advisor can also generate a detailed runbook for recovering from operational incidents, including recovery procedures for specific components and validation steps to ensure the workload is fully restored.
Practical Applications and Next Steps
The speaker provides QR codes for attendees to access the Strands SDK, the Resilience Advisor code, and a session on how to build this solution.
For nonprofit organizations, the speaker encourages attendees to scan additional QR codes to connect with their AWS account team and learn about other relevant sessions at the conference.
The speaker emphasizes the importance of completing the session survey to provide feedback and help improve the content.
Key Takeaways
Resilience is critical for cloud-based workloads, especially in the public sector and humanitarian aid domains, where failures can have severe consequences.
The five SEAMS principles provide a comprehensive framework for designing and assessing the resilience of cloud-based systems.
Generative AI can be leveraged to automate the resilience assessment process, identify areas for improvement, and generate actionable runbooks for incident response.
Improving observability through comprehensive monitoring, tracing, and logging is a key step in building resilient systems.
The Agentic Resilience Advisor demonstrated in the presentation is a practical example of how organizations can leverage AI to enhance the resilience of their cloud-based workloads.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.