TalksAWS re:Invent 2025 - Reimagining incident response with MCP on AWS (DEV340)
AWS re:Invent 2025 - Reimagining incident response with MCP on AWS (DEV340)
Reimagining Incident Response with MCP on AWS
The Problem: Context Paralysis in Incident Response
Traditional incident response approaches often lead to "context paralysis" - too much data, no shared understanding of the root cause.
During an incident, teams are bombarded with a flood of alerts from various services (Cognito, API Gateway, Lambda, RDS, etc.), but lack the context to piece together the full picture.
This leads to guesswork, finger-pointing, and delays in resolving the issue, as teams struggle to reconstruct the sequence of events and identify the root cause.
The Solution: From Chaos to Platform Intelligence
The solution is not more dashboards, alerts, or metrics - it's about providing the right context and intelligence to respond effectively.
The key is a new approach called "Model Context Protocol" (MCP), which sits between the chaos and the actions, correlating signals, applying AI analysis, and presenting meaningful, actionable intelligence.
The MCP Blueprint: Three Specialized MCP Servers
Oflow Context:
Analyzes the incident, correlating events, assessing authentication health, and detecting bottlenecks.
Uses Amazon Cloudwatch, DynamoDB, and EventBridge data, along with Amazon Bedrock and CloudSoNet 3.5 for complex pattern recognition.
Service Dependency Mapping:
Maps service dependencies to predict cascade failures and suggest isolation strategies.
Leverages CloudHaiku 3 v1 to generate fast, visual dependency maps.
Recovery Coordination:
Suggests remediation actions, categorizing them by risk level (low, medium, high).
Requires human approval for medium and high-risk actions, providing information to help decision-making.
Uses CloudSoNet 3.5 for remediation action generation, risk assessment, and recovery process prediction.
The MCP Architecture
The MCP blueprint is a serverless architecture, with the MCP servers integrating with various AWS services (Cloudwatch, DynamoDB, Lambda, EventBridge) to gather real-time data.
This data is then sent to Amazon Bedrock, which applies the appropriate cloud models to provide the analysis and intelligence.
Beyond MTR: Designing for Calm
The goal is not just faster incident response (MTR), but building "calm" into the cloud - empowering human intelligence through contextual correlation, not just metric multiplication.
This shift in mindset moves away from automating chaos and towards engineering clarity into the system, using AI to amplify human judgment, not replace it.
Key Takeaways
Context matters more than metrics - AI can organize the chaos and provide the necessary context.
AI should be used to amplify human intelligence, not replace it. Humans should remain in the loop to make the final decisions.
Design for calm, not just faster incident response. Engineer clarity into the system to create a consistent, pressure-resistant interface.
Real-world Impact
The MCP approach provides teams with immediate visibility into the impact and cascade of an incident, allowing for faster isolation and remediation.
By predicting the blast radius and suggesting targeted isolation strategies, the service dependency mapping can help prevent cascading failures.
The recovery coordination component empowers teams to execute low-risk actions quickly, while providing the necessary information and approval process for medium and high-risk actions.
Overall, the MCP blueprint helps organizations move from a chaotic, reactive incident response to a more proactive, intelligent, and calm approach, improving mean time to resolution (MTR) and reducing the business impact of incidents.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.