TalksAWS re:Invent 2025 - Agentic Workflows: How Salesforce Manages 1000+ Clusters (OPN310)

AWS re:Invent 2025 - Agentic Workflows: How Salesforce Manages 1000+ Clusters (OPN310)

Agentic Workflows: How Salesforce Manages 1000+ Kubernetes Clusters

Kubernetes Operational Challenges

Scaling Kubernetes Operations

  • Salesforce manages over 1,400 Kubernetes clusters across multiple cloud providers
  • They run hundreds of thousands of compute nodes and millions of pods
  • Scaling operations to support 5x growth in the next few years is a key business goal

Operational Complexity and Toil

  • Dealing with constant alerts, metrics, logs, and tracing data across a large fleet
  • Isolating issues, identifying root causes, and applying fixes is extremely time-consuming
  • Engineers spend more time troubleshooting than actually resolving problems

Limitations of Existing Tooling

  • Siloed tools that don't integrate well, requiring manual context switching
  • Steep learning curve for engineers to become proficient with all the different tools
  • Limited feedback loops to continuously improve the tooling and workflows

Introducing Agentic Workflows and AI Ops

What are Agentic Workflows?

  • Agents with specific goals, tools, and memory (short-term and long-term)
  • Agents can invoke actions and have a tight observation loop to monitor performance
  • Different types of agents: simple assistants, deterministic, autonomous, and multi-agent collaboration

Benefits of Agentic Workflows

  • Correlate telemetry signals (metrics, logs, traces) to identify root causes faster
  • Provide intelligent recommendations and remediation steps to resolve issues
  • Automate repetitive tasks and reduce human operational toil

Salesforce's Agentic Workflow Prototype

  • Integrated three agents:
    1. Prometheus agent to fetch metrics and utilization data
    2. KGPT agent to analyze Kubernetes events and pod logs
    3. Argo CD agent to perform remediation actions (e.g., scaling, restarting pods)
  • Centralized "collaborator" agent that orchestrates the individual agents
  • Allows operators to ask natural language questions and get automated responses

Salesforce's Journey to AI-Powered Self-Healing

Evolving from Siloed Tools to Agentic Workflows

  • Existing tools (Sloop, Periscope, Cube Magic Mirror) were helpful but had limitations
  • Needed a more integrated solution to correlate data, identify issues, and automate remediation

Building the AI Ops Framework

  • Leveraged an internal agentic framework and embedded RaG database for data governance
  • Implemented a multi-agent architecture with a "manager" agent orchestrating worker agents
  • Worker agents fetch telemetry data from various sources (logs, metrics, events, traces)
  • Manager agent uses LLMs and runbook knowledge to diagnose issues and generate remediation plans

Ensuring Safe Operations

  • Implemented "safe operations" using Argo workflows with necessary guardrails
    • Respecting disruption budgets, scaling limits, and other operational constraints
    • Requiring human approval for critical actions before execution in production
  • Established progressive autonomy, starting with full human oversight and gradually increasing AI autonomy

Key Outcomes and Learnings

  • 30% improvement in troubleshooting time
  • 150 hours per month saved in operational toil
  • Importance of well-structured runbooks and knowledge graphs for complex issue diagnosis
  • Continuous feedback loops to improve agent performance and runbook accuracy
  • Leveraging existing tools and integrating them into the agentic framework

Future Directions

Connecting the Dots with Knowledge Graphs

  • Capturing infrastructure topology, component relationships, and failure modes in a structured knowledge graph
  • Enabling AI to traverse the knowledge graph and diagnose complex, multi-component issues

Leveraging Feedback and Historical Data

  • Recording successes and failures to improve the accuracy of root cause analysis and remediation
  • Using historical data to accelerate the diagnosis process and provide more reliable recommendations

Exploring AI for Anomaly Detection and Performance Troubleshooting

  • Empowering AI to analyze vast amounts of metrics, logs, and trace data to uncover hidden issues
  • Identifying anomalies and performance problems that are difficult for humans to detect

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.