TalksAWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

Improving Agent Quality in Production with Bedrock AgentCore Evaluations

Overview

  • AWS presented a solution to help developers build and deploy trustworthy AI agents at scale - Amazon Bedrock AgentCore Evaluations
  • This service provides a fully managed, continuous assessment framework to monitor and improve the quality of AI agents in production

Key Challenges with Operating Agents at Scale

  • Agents are non-deterministic and can reason/act autonomously, making it difficult to ensure they perform tasks correctly and consistently
  • Developers lack the right tools to evaluate agent performance in real-time and proactively address quality issues before they impact customers
  • The process of creating evaluation datasets, selecting metrics, and maintaining infrastructure is time-consuming and complex, delaying agent deployment

Amazon Bedrock AgentCore Evaluations

  • Fully managed service that provides continuous, automated assessment of AI agents across key quality dimensions
  • Offers 13 built-in evaluators covering correctness, helpfulness, stereotyping, tool usage, and more
  • Allows creation of custom evaluators for domain-specific requirements
  • Operates in two modes:
    1. Online evaluations: Continuously monitors a sample of live agent interactions in production
    2. On-demand evaluations: Integrates with CI/CD pipelines to validate agent changes before deployment

Technical Deep Dive

  • Evaluations use detailed, structured rubrics and provide complete context (conversation history, user intent, tools used, etc.) to the evaluation model
  • Scores are accompanied by explanations to ensure transparency and consistency
  • Supports popular observability instrumentation like OpenTelemetry
  • Seamlessly integrates with existing agent deployments, no code changes required

Improving Agent Quality Lifecycle

  1. Baseline: Establish initial performance metrics for the agent using on-demand evaluations
  2. Iterate: Analyze evaluation results, make improvements to prompts/models, and re-evaluate
  3. Deploy: When agent meets success criteria, deploy to production with online continuous monitoring
  4. Monitor: Ongoing real-time tracking of agent quality in production, with ability to quickly diagnose and address issues

Business Impact

  • Reduces time to detect and diagnose agent quality issues from weeks to minutes/hours
  • Enables proactive monitoring and quality assurance, preventing silent failures that impact customer experience
  • Accelerates agent deployment by automating the evaluation process, freeing up developers to focus on improving agent capabilities

Example Use Case: Wonderless Travel Platform

  • Travel search assistant agent with multiple specialized tools (climate data, flight info, currency conversion, web search)
  • Initially, agent was selecting the wrong tools, leading to poor user experience and increased negative feedback
  • With AgentCore Evaluations:
    • Configured tool selection accuracy, parameter correctness, and helpfulness as key metrics
    • Detected rapid decline in tool selection accuracy while other metrics remained stable
    • Diagnosis revealed recent prompt changes had removed guidance on tool selection
    • Made prompt updates to restore tool selection guidance, resolving the issue within days

Key Takeaways

  • Defining a multi-dimensional success criteria is crucial, including both operational and user experience metrics
  • Rigorous testing, including baseline establishment and iterative improvements, is essential for building trustworthy agents
  • Continuous monitoring and the ability to quickly diagnose issues are critical for maintaining agent quality in production
  • AgentCore Evaluations automates the evaluation process, allowing developers to focus on improving agent capabilities rather than managing infrastructure

Resources

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.