TalksAWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336)

AWS re:Invent 2025 - Using AI to improve humanitarian workload resilience (AIM336)

Leveraging AI to Enhance Workload Resilience

Importance of Resilient Humanitarian Workloads

  • Humanitarian workloads, designed to help people in need during crises, require high reliability and resilience
  • Failures like timeouts or 404 errors are unacceptable when people are calling for emergency assistance
  • Building resilient systems is challenging, requiring attention to high availability, disaster recovery, and continuous improvement

Key Resilience Concepts

  • High Availability: Ability to withstand common failures like server outages or network interruptions
  • Disaster Recovery: Ability to fail over to a secondary site and recover from major disruptions
  • Continuous Improvement: Iteratively enhancing resilience over time based on learnings from failures

Resilience Failure Categories

  1. Single Points of Failure: Single components or systems that, if they fail, bring down the entire application
  2. Excessive Load: Inability to scale and handle sudden increases in traffic
  3. Excessive Latency: Timeouts and failures due to unpredictable network latency
  4. Misconfiguration and Bugs: Issues caused by code defects or configuration errors
  5. Shared Fate: Common dependencies that, when they fail, impact multiple applications

Leveraging Agentic AI

  • Building an AI agent with agency to autonomously investigate and reason about workload resilience
  • Utilizing Amazon Bedrock foundation model and existing AWS tools like Cloud Control API and AWS documentation

Agent Architecture and Functionality

  1. Foundation Model: Using Amazon Bedrock for reasoning and inference
  2. Cloud Control API: Interacting with AWS resources to assess the current state
  3. AWS Documentation MCP: Retrieving relevant documentation to provide recommendations

Detailed Prompt Engineering

  • Separating the agent's prompt into a separate file for easier maintenance and iteration
  • Defining a comprehensive prompt covering:
    • Agent persona and expertise
    • Grading philosophy and scale
    • Resilience definitions and requirements
    • Specific instructions for tool usage and output formatting

Demonstration and Results

  • Agent analyzes a "food agent" workload with 24-hour RTO and 12-hour RPO
  • Provides detailed resilience assessments, grading the workload across key areas
  • Adjusts recommendations when RTO/RPO requirements change
  • Identifies observability improvements and generates a recovery runbook

Key Takeaways

  • Agentic AI can significantly simplify the process of building resilient systems by automating analysis and providing actionable guidance
  • Comprehensive prompt engineering is crucial to ensure the agent provides consistent, reliable, and contextually appropriate recommendations
  • Integrating the agent with existing AWS tools and documentation enables it to leverage a wealth of resources to support its analysis
  • Deploying the agent on a platform like Bedrock Agent Core can make it accessible and usable across an organization

Next Steps and Resources

  • Explore the Strands Agent documentation for more information on building AI-powered agents
  • Check out the nonprofit sample code on GitHub to get hands-on with the resilience agent
  • Reach out to your AWS account team for assistance in implementing similar solutions
  • Utilize AWS Skill Builder resources for further learning on resilience, AI, and AWS services

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.