Talks AWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101) VIDEO
AWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101) Unlocking Enterprise Resilience: AI and Automation in Action
Evolving the Role of the SRE (Site Reliability Engineer)
SREs are transitioning from reactive incident responders to proactive managers of AI agents
AI agents are becoming the new "employees" that handle day-to-day IT operations
SREs must now onboard, train, monitor, and optimize these AI agents like human workers
Key responsibilities include:
Ensuring AI agents operate within security and compliance policies
Continuously evaluating agent performance and accuracy
Providing feedback to improve agent capabilities over time
Maintaining a registry of available AI agents and their specialties
Foundational AI Ops Capabilities
Pre-built connectors to core data sources (logs, metrics, alerts, tickets)
Structured and unstructured data processing pipelines
Summarization, root cause analysis, anomaly detection
Extensible blueprints to quickly build new AI agent workflows
Deployment, governance, monitoring, and registry for AI agents
Shifting Left with AI
Using AI to shift incident management "left" in the development lifecycle
Embedding AI agents directly into developer workflows to:
Detect potential issues before deployment
Automatically generate test cases and validate changes
Provide contextual guidance on past incidents
Goal is to reduce toil and improve resilience through proactive, automated processes
Balancing Human and AI Capabilities
Humans excel at novel problem solving and creative thinking
AI agents excel at pattern recognition, data correlation, and automating repetitive tasks
Key is to leverage the strengths of both, with humans in the loop for high-risk decisions
Importance of "showing work" - AI agents must be transparent about their reasoning
Establishing secure communication and authorization protocols between AI agents
Business Impact and Examples
Nvidia IT:
Building an "AI factory" to enable self-service AI agent development
Deploying agents for alert noise reduction, incident summarization, root cause analysis
Octa:
Democratizing LLM usage across the organization, not just for engineers
Creating an "AI Gallery" to share prompts and workflows
Automating scheduling, travel booking, and conference room planning
Key Takeaways
AI is transforming IT operations from reactive to proactive and predictive
SREs must evolve to become managers and trainers of an AI-powered "digital workforce"
Establishing secure, transparent, and collaborative AI agent systems is critical
Leveraging AI to shift left in the development lifecycle can improve resilience
Successful AI Ops requires a balance of human and machine capabilities
Your Digital Journey deserves a great story. Build one with us.