TalksAWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101)

AWS re:Invent 2025 - Unlocking Enterprise Resilience: AI and Automation in Action (AIM101)

Unlocking Enterprise Resilience: AI and Automation in Action

Evolving the Role of the SRE (Site Reliability Engineer)

  • SREs are transitioning from reactive incident responders to proactive managers of AI agents
  • AI agents are becoming the new "employees" that handle day-to-day IT operations
  • SREs must now onboard, train, monitor, and optimize these AI agents like human workers
  • Key responsibilities include:
    • Ensuring AI agents operate within security and compliance policies
    • Continuously evaluating agent performance and accuracy
    • Providing feedback to improve agent capabilities over time
    • Maintaining a registry of available AI agents and their specialties

Foundational AI Ops Capabilities

  • Pre-built connectors to core data sources (logs, metrics, alerts, tickets)
  • Structured and unstructured data processing pipelines
    • Summarization, root cause analysis, anomaly detection
  • Extensible blueprints to quickly build new AI agent workflows
  • Deployment, governance, monitoring, and registry for AI agents

Shifting Left with AI

  • Using AI to shift incident management "left" in the development lifecycle
  • Embedding AI agents directly into developer workflows to:
    • Detect potential issues before deployment
    • Automatically generate test cases and validate changes
    • Provide contextual guidance on past incidents
  • Goal is to reduce toil and improve resilience through proactive, automated processes

Balancing Human and AI Capabilities

  • Humans excel at novel problem solving and creative thinking
  • AI agents excel at pattern recognition, data correlation, and automating repetitive tasks
  • Key is to leverage the strengths of both, with humans in the loop for high-risk decisions
  • Importance of "showing work" - AI agents must be transparent about their reasoning
  • Establishing secure communication and authorization protocols between AI agents

Business Impact and Examples

  • Nvidia IT:
    • Building an "AI factory" to enable self-service AI agent development
    • Deploying agents for alert noise reduction, incident summarization, root cause analysis
  • Octa:
    • Democratizing LLM usage across the organization, not just for engineers
    • Creating an "AI Gallery" to share prompts and workflows
    • Automating scheduling, travel booking, and conference room planning

Key Takeaways

  • AI is transforming IT operations from reactive to proactive and predictive
  • SREs must evolve to become managers and trainers of an AI-powered "digital workforce"
  • Establishing secure, transparent, and collaborative AI agent systems is critical
  • Leveraging AI to shift left in the development lifecycle can improve resilience
  • Successful AI Ops requires a balance of human and machine capabilities

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.