TalksAWS re:Invent 2025 - AIOps Revolution: How iHeart slashed incident response time by 60% with Bedrock
AWS re:Invent 2025 - AIOps Revolution: How iHeart slashed incident response time by 60% with Bedrock
Transforming IT Incident Response with Agentic AI: iHeart's Journey
The Challenge of Modern IT Incidents
Large digital organizations like iHeart Media face complex, distributed IT systems that make it difficult to quickly identify and resolve incidents
Incident response often involves a "seven circles of on-call hell" - logging in, hunting for information, relying on tribal knowledge, manual diagnosis, and more - wasting precious time
Traditional monitoring systems generate too much noise, making it hard to pinpoint root causes
iHeart Media's Scale and Operations
iHeart Media is a massive media company with:
850+ AM/FM radio stations
250 million monthly digital users
150 million monthly podcast downloads
5-7 billion monthly digital requests
70+ AWS services powering its architecture
The company's digital platform is mission-critical, requiring 24/7 uptime and fast incident resolution to avoid major business impacts
Introducing Agentic AI for IT Operations (AIOps)
iHeart built a multi-agent AI system to automate incident response and remediation
Key components:
Slack bot interface for human interaction
Orchestrator agent to delegate tasks to specialized sub-agents
Sub-agents for monitoring, logs, Kubernetes, knowledge base, etc.
Leveraging AWS Bedrock Agent Core for secure, scalable agent deployment
Agents work together to quickly triage incidents, diagnose root causes, and recommend remediation steps
Benefits of the Agentic AI Approach
60% reduction in incident response time by automating triage and diagnosis
Improved operational efficiency and reduced toil for on-call engineers
Increased consistency and reliability in incident response
Preservation of institutional knowledge for faster future incident resolution
Implementing the Agentic AI Solution
Slack bot interface allows simple, natural language interaction to trigger incident investigation
Orchestrator agent delegates tasks to specialized sub-agents, each with their own context window
Prevents sub-agents from overloading the main context with unnecessary data
Allows parallel, targeted investigations across monitoring, logs, Kubernetes, etc.
Bedrock Agent Core provides a secure, scalable runtime to deploy and manage the multi-agent system
Lessons and Next Steps
Quality of context data is critical - "garbage in, garbage out" for AI agents
Gradual adoption approach: start with read-only, low-risk tasks before expanding to high-stakes actions
Build a robust evaluation environment to continuously test and validate agent performance
Future goals include expanding agent capabilities, integrating more data sources, and enabling proactive incident prevention
Key Takeaways
Agentic AI can revolutionize IT incident response by automating triage, diagnosis, and remediation
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.