TalksAWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)
AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)
Improving Agent Quality in Production with Bedrock AgentCore Evaluations
Overview
AWS presented a solution to help developers build and deploy trustworthy AI agents at scale - Amazon Bedrock AgentCore Evaluations
This service provides a fully managed, continuous assessment framework to monitor and improve the quality of AI agents in production
Key Challenges with Operating Agents at Scale
Agents are non-deterministic and can reason/act autonomously, making it difficult to ensure they perform tasks correctly and consistently
Developers lack the right tools to evaluate agent performance in real-time and proactively address quality issues before they impact customers
The process of creating evaluation datasets, selecting metrics, and maintaining infrastructure is time-consuming and complex, delaying agent deployment
Amazon Bedrock AgentCore Evaluations
Fully managed service that provides continuous, automated assessment of AI agents across key quality dimensions
Offers 13 built-in evaluators covering correctness, helpfulness, stereotyping, tool usage, and more
Allows creation of custom evaluators for domain-specific requirements
Operates in two modes:
Online evaluations: Continuously monitors a sample of live agent interactions in production
On-demand evaluations: Integrates with CI/CD pipelines to validate agent changes before deployment
Technical Deep Dive
Evaluations use detailed, structured rubrics and provide complete context (conversation history, user intent, tools used, etc.) to the evaluation model
Scores are accompanied by explanations to ensure transparency and consistency
Supports popular observability instrumentation like OpenTelemetry
Seamlessly integrates with existing agent deployments, no code changes required
Improving Agent Quality Lifecycle
Baseline: Establish initial performance metrics for the agent using on-demand evaluations
Iterate: Analyze evaluation results, make improvements to prompts/models, and re-evaluate
Deploy: When agent meets success criteria, deploy to production with online continuous monitoring
Monitor: Ongoing real-time tracking of agent quality in production, with ability to quickly diagnose and address issues
Business Impact
Reduces time to detect and diagnose agent quality issues from weeks to minutes/hours
Enables proactive monitoring and quality assurance, preventing silent failures that impact customer experience
Accelerates agent deployment by automating the evaluation process, freeing up developers to focus on improving agent capabilities
Example Use Case: Wonderless Travel Platform
Travel search assistant agent with multiple specialized tools (climate data, flight info, currency conversion, web search)
Initially, agent was selecting the wrong tools, leading to poor user experience and increased negative feedback
With AgentCore Evaluations:
Configured tool selection accuracy, parameter correctness, and helpfulness as key metrics
Detected rapid decline in tool selection accuracy while other metrics remained stable
Diagnosis revealed recent prompt changes had removed guidance on tool selection
Made prompt updates to restore tool selection guidance, resolving the issue within days
Key Takeaways
Defining a multi-dimensional success criteria is crucial, including both operational and user experience metrics
Rigorous testing, including baseline establishment and iterative improvements, is essential for building trustworthy agents
Continuous monitoring and the ability to quickly diagnose issues are critical for maintaining agent quality in production
AgentCore Evaluations automates the evaluation process, allowing developers to focus on improving agent capabilities rather than managing infrastructure
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.