Talks AWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336) VIDEO
AWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336) Leveraging AI to Enhance Workload Resilience
Importance of Resilient Humanitarian Workloads
Humanitarian workloads, designed to help people in need during crises, require high reliability and resilience
Failures like timeouts or 404 errors are unacceptable when people are calling for emergency assistance
Building resilient systems is challenging, requiring attention to high availability, disaster recovery, and continuous improvement
Key Resilience Concepts
High Availability: Ability to withstand common failures like server outages or network interruptions
Disaster Recovery: Ability to fail over to a secondary site and recover from major disruptions
Continuous Improvement: Iteratively enhancing resilience over time based on learnings from failures
Resilience Failure Categories
Single Points of Failure : Single components or systems that, if they fail, bring down the entire application
Excessive Load : Inability to scale and handle sudden increases in traffic
Excessive Latency : Timeouts and failures due to unpredictable network latency
Misconfiguration and Bugs : Issues caused by code defects or configuration errors
Shared Fate : Common dependencies that, when they fail, impact multiple applications
Leveraging Agentic AI
Building an AI agent with agency to autonomously investigate and reason about workload resilience
Utilizing Amazon Bedrock foundation model and existing AWS tools like Cloud Control API and AWS documentation
Agent Architecture and Functionality
Foundation Model : Using Amazon Bedrock for reasoning and inference
Cloud Control API : Interacting with AWS resources to assess the current state
AWS Documentation MCP : Retrieving relevant documentation to provide recommendations
Detailed Prompt Engineering
Separating the agent's prompt into a separate file for easier maintenance and iteration
Defining a comprehensive prompt covering:
Agent persona and expertise
Grading philosophy and scale
Resilience definitions and requirements
Specific instructions for tool usage and output formatting
Demonstration and Results
Agent analyzes a "food agent" workload with 24-hour RTO and 12-hour RPO
Provides detailed resilience assessments, grading the workload across key areas
Adjusts recommendations when RTO/RPO requirements change
Identifies observability improvements and generates a recovery runbook
Key Takeaways
Agentic AI can significantly simplify the process of building resilient systems by automating analysis and providing actionable guidance
Comprehensive prompt engineering is crucial to ensure the agent provides consistent, reliable, and contextually appropriate recommendations
Integrating the agent with existing AWS tools and documentation enables it to leverage a wealth of resources to support its analysis
Deploying the agent on a platform like Bedrock Agent Core can make it accessible and usable across an organization
Next Steps and Resources
Explore the Strands Agent documentation for more information on building AI-powered agents
Check out the nonprofit sample code on GitHub to get hands-on with the resilience agent
Reach out to your AWS account team for assistance in implementing similar solutions
Utilize AWS Skill Builder resources for further learning on resilience, AI, and AWS services
Your Digital Journey deserves a great story. Build one with us.