AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

Improving Agent Quality in Production with Bedrock AgentCore Evaluations

Overview

AWS presented a solution to help developers build and deploy trustworthy AI agents at scale - Amazon Bedrock AgentCore Evaluations

This service provides a fully managed, continuous assessment framework to monitor and improve the quality of AI agents in production

Key Challenges with Operating Agents at Scale

Agents are non-deterministic and can reason/act autonomously, making it difficult to ensure they perform tasks correctly and consistently

Developers lack the right tools to evaluate agent performance in real-time and proactively address quality issues before they impact customers

The process of creating evaluation datasets, selecting metrics, and maintaining infrastructure is time-consuming and complex, delaying agent deployment

Amazon Bedrock AgentCore Evaluations

Fully managed service that provides continuous, automated assessment of AI agents across key quality dimensions

Offers 13 built-in evaluators covering correctness, helpfulness, stereotyping, tool usage, and more

Allows creation of custom evaluators for domain-specific requirements

Operates in two modes:

Online evaluations: Continuously monitors a sample of live agent interactions in production
On-demand evaluations: Integrates with CI/CD pipelines to validate agent changes before deployment

Technical Deep Dive

Evaluations use detailed, structured rubrics and provide complete context (conversation history, user intent, tools used, etc.) to the evaluation model

Scores are accompanied by explanations to ensure transparency and consistency

Supports popular observability instrumentation like OpenTelemetry

Seamlessly integrates with existing agent deployments, no code changes required

Improving Agent Quality Lifecycle

Baseline: Establish initial performance metrics for the agent using on-demand evaluations

Iterate: Analyze evaluation results, make improvements to prompts/models, and re-evaluate

Deploy: When agent meets success criteria, deploy to production with online continuous monitoring

Monitor: Ongoing real-time tracking of agent quality in production, with ability to quickly diagnose and address issues

Business Impact

Reduces time to detect and diagnose agent quality issues from weeks to minutes/hours

Enables proactive monitoring and quality assurance, preventing silent failures that impact customer experience

Accelerates agent deployment by automating the evaluation process, freeing up developers to focus on improving agent capabilities

Example Use Case: Wonderless Travel Platform

Travel search assistant agent with multiple specialized tools (climate data, flight info, currency conversion, web search)

Initially, agent was selecting the wrong tools, leading to poor user experience and increased negative feedback

With AgentCore Evaluations:

Configured tool selection accuracy, parameter correctness, and helpfulness as key metrics
Detected rapid decline in tool selection accuracy while other metrics remained stable
Diagnosis revealed recent prompt changes had removed guidance on tool selection
Made prompt updates to restore tool selection guidance, resolving the issue within days

Key Takeaways

Defining a multi-dimensional success criteria is crucial, including both operational and user experience metrics

Rigorous testing, including baseline establishment and iterative improvements, is essential for building trustworthy agents

Continuous monitoring and the ability to quickly diagnose issues are critical for maintaining agent quality in production

AgentCore Evaluations automates the evaluation process, allowing developers to focus on improving agent capabilities rather than managing infrastructure

Resources

AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

Improving Agent Quality in Production with Bedrock AgentCore Evaluations

Overview

Key Challenges with Operating Agents at Scale

Amazon Bedrock AgentCore Evaluations

Technical Deep Dive

Improving Agent Quality Lifecycle

Business Impact

Example Use Case: Wonderless Travel Platform

Key Takeaways

Resources

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

Improving Agent Quality in Production with Bedrock AgentCore Evaluations

Overview

Key Challenges with Operating Agents at Scale

Amazon Bedrock AgentCore Evaluations

Technical Deep Dive

Improving Agent Quality Lifecycle

Business Impact

Example Use Case: Wonderless Travel Platform

Key Takeaways

Resources

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.