TalksAWS re:Invent 2025 - AI-Driven Automation for Modern Operations (AIM219)

AWS re:Invent 2025 - AI-Driven Automation for Modern Operations (AIM219)

Leveraging AI-Driven Automation for Modern Operations

Aligning on Operational Metadata and Critical User Journeys

  • Warner Brothers Discovery (WBD) faced the challenge of merging two organizations and launching a new streaming application (HBO Max) within a 9-month deadline.
  • To establish consistency and standardized processes, they developed an "Operational Metadata (OMD) Schema" to catalog services and systems throughout the software development lifecycle.
  • This allowed them to trace incidents back to specific business functions and microservices, providing crucial context during incident response.
  • They also defined "Critical User Journeys" (CUJs) - horizontal slices of functionality most important to customers. This helped prioritize work and classify incidents.
  • Involving cross-functional stakeholders in defining CUJs and OMD helped create a common language and alignment around reliability priorities.

Evolving the Incident Management Lifecycle with AI

  • As incidents occur, WBD leverages Pagerduty capabilities to streamline the incident management process:
    • Noise Reduction: Automatically merges similar alerts into a single incident, reducing noise by 40-50%.
    • Operations Console: Provides a real-time view of filtered incidents, allowing cross-functional visibility during critical launches.
    • Custom Fields: Capture metadata like blast radius and support ticket counts to enable data-driven decision making.
  • For well-understood incidents, WBD envisions a "fully AI-landed incident management lifecycle" with automated detection, status updates, and potentially some mitigations - while maintaining human oversight.
  • Agents are used to:
    • Provide context by referencing the OMD service catalog and runbooks during incidents.
    • Generate incident timelines and summaries for post-incident reviews, surfacing insights not easily captured by humans.
    • Validate and improve agent performance by testing against historical incident data.

Building Trust and Improving Agents Over Time

  • WBD onboards "early adopter" teams to pilot new agent-based capabilities, gathering feedback to refine the solutions.
  • They leverage historical incident data to validate agent outputs, ensuring accuracy before expanding usage.
  • Post-incident reviews are used to further improve agent performance, with agents analyzing past incidents to identify systemic issues and opportunities for enhancement.
  • Integrating agent-based capabilities into the operational excellence program ensures continuous improvement and accountability.

Key Takeaways

  • Investing in a robust operational metadata schema and critical user journeys provides a foundation for effective AI-driven automation.
  • Automating well-understood incident management tasks (e.g., status updates, timelines) can significantly reduce toil and context switching.
  • Agents can augment human expertise by providing contextual information, surfacing insights from historical data, and automating repetitive processes.
  • Building trust in agent-based capabilities requires validation against real-world incidents and continuous feedback loops to improve performance over time.
  • Aligning cross-functional stakeholders on reliability priorities and processes is crucial for successful adoption of AI-driven automation.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.