AWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101)

Unlocking Enterprise Resilience: AI and Automation in Action

Evolving the Role of the SRE (Site Reliability Engineer)

SREs are transitioning from reactive incident responders to proactive managers of AI agents

AI agents are becoming the new "employees" that handle day-to-day IT operations

SREs must now onboard, train, monitor, and optimize these AI agents like human workers

Key responsibilities include:

Ensuring AI agents operate within security and compliance policies
Continuously evaluating agent performance and accuracy
Providing feedback to improve agent capabilities over time
Maintaining a registry of available AI agents and their specialties

Foundational AI Ops Capabilities

Pre-built connectors to core data sources (logs, metrics, alerts, tickets)

Structured and unstructured data processing pipelines

Summarization, root cause analysis, anomaly detection

Extensible blueprints to quickly build new AI agent workflows

Deployment, governance, monitoring, and registry for AI agents

Shifting Left with AI

Using AI to shift incident management "left" in the development lifecycle

Embedding AI agents directly into developer workflows to:

Detect potential issues before deployment
Automatically generate test cases and validate changes
Provide contextual guidance on past incidents

Goal is to reduce toil and improve resilience through proactive, automated processes

Balancing Human and AI Capabilities

Humans excel at novel problem solving and creative thinking

AI agents excel at pattern recognition, data correlation, and automating repetitive tasks

Key is to leverage the strengths of both, with humans in the loop for high-risk decisions

Importance of "showing work" - AI agents must be transparent about their reasoning

Establishing secure communication and authorization protocols between AI agents

Business Impact and Examples

Nvidia IT:

Building an "AI factory" to enable self-service AI agent development
Deploying agents for alert noise reduction, incident summarization, root cause analysis

Octa:

Democratizing LLM usage across the organization, not just for engineers
Creating an "AI Gallery" to share prompts and workflows
Automating scheduling, travel booking, and conference room planning

Key Takeaways

AI is transforming IT operations from reactive to proactive and predictive

SREs must evolve to become managers and trainers of an AI-powered "digital workforce"

Establishing secure, transparent, and collaborative AI agent systems is critical

Leveraging AI to shift left in the development lifecycle can improve resilience

Successful AI Ops requires a balance of human and machine capabilities

AWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101)

Unlocking Enterprise Resilience: AI and Automation in Action

Evolving the Role of the SRE (Site Reliability Engineer)

Foundational AI Ops Capabilities

Shifting Left with AI

Balancing Human and AI Capabilities

Business Impact and Examples

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101)

Unlocking Enterprise Resilience: AI and Automation in Action

Evolving the Role of the SRE (Site Reliability Engineer)

Foundational AI Ops Capabilities

Shifting Left with AI

Balancing Human and AI Capabilities

Business Impact and Examples

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.