AWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336)

Leveraging AI to Enhance Workload Resilience

Importance of Resilient Humanitarian Workloads

Humanitarian workloads, designed to help people in need during crises, require high reliability and resilience

Failures like timeouts or 404 errors are unacceptable when people are calling for emergency assistance

Building resilient systems is challenging, requiring attention to high availability, disaster recovery, and continuous improvement

Key Resilience Concepts

High Availability: Ability to withstand common failures like server outages or network interruptions

Disaster Recovery: Ability to fail over to a secondary site and recover from major disruptions

Continuous Improvement: Iteratively enhancing resilience over time based on learnings from failures

Resilience Failure Categories

Single Points of Failure: Single components or systems that, if they fail, bring down the entire application

Excessive Load: Inability to scale and handle sudden increases in traffic

Excessive Latency: Timeouts and failures due to unpredictable network latency

Misconfiguration and Bugs: Issues caused by code defects or configuration errors

Shared Fate: Common dependencies that, when they fail, impact multiple applications

Leveraging Agentic AI

Building an AI agent with agency to autonomously investigate and reason about workload resilience

Utilizing Amazon Bedrock foundation model and existing AWS tools like Cloud Control API and AWS documentation

Agent Architecture and Functionality

Foundation Model: Using Amazon Bedrock for reasoning and inference

Cloud Control API: Interacting with AWS resources to assess the current state

AWS Documentation MCP: Retrieving relevant documentation to provide recommendations

Detailed Prompt Engineering

Separating the agent's prompt into a separate file for easier maintenance and iteration

Defining a comprehensive prompt covering:

Agent persona and expertise
Grading philosophy and scale
Resilience definitions and requirements
Specific instructions for tool usage and output formatting

Demonstration and Results

Agent analyzes a "food agent" workload with 24-hour RTO and 12-hour RPO

Provides detailed resilience assessments, grading the workload across key areas

Adjusts recommendations when RTO/RPO requirements change

Identifies observability improvements and generates a recovery runbook

Key Takeaways

Agentic AI can significantly simplify the process of building resilient systems by automating analysis and providing actionable guidance

Comprehensive prompt engineering is crucial to ensure the agent provides consistent, reliable, and contextually appropriate recommendations

Integrating the agent with existing AWS tools and documentation enables it to leverage a wealth of resources to support its analysis

Deploying the agent on a platform like Bedrock Agent Core can make it accessible and usable across an organization

Next Steps and Resources

Explore the Strands Agent documentation for more information on building AI-powered agents

Check out the nonprofit sample code on GitHub to get hands-on with the resilience agent

Reach out to your AWS account team for assistance in implementing similar solutions

Utilize AWS Skill Builder resources for further learning on resilience, AI, and AWS services

AWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336)

Leveraging AI to Enhance Workload Resilience

Importance of Resilient Humanitarian Workloads

Key Resilience Concepts

Resilience Failure Categories

Leveraging Agentic AI

Agent Architecture and Functionality

Detailed Prompt Engineering

Demonstration and Results

Key Takeaways

Next Steps and Resources

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336)

Leveraging AI to Enhance Workload Resilience

Importance of Resilient Humanitarian Workloads

Key Resilience Concepts

Resilience Failure Categories

Leveraging Agentic AI

Agent Architecture and Functionality

Detailed Prompt Engineering

Demonstration and Results

Key Takeaways

Next Steps and Resources

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.