Talks AWS re:Invent 2025 - Agentic Workflows: How Salesforce Manages 1000+ Clusters (OPN310) VIDEO
AWS re:Invent 2025 - Agentic Workflows: How Salesforce Manages 1000+ Clusters (OPN310) Agentic Workflows: How Salesforce Manages 1000+ Kubernetes Clusters
Kubernetes Operational Challenges
Scaling Kubernetes Operations
Salesforce manages over 1,400 Kubernetes clusters across multiple cloud providers
They run hundreds of thousands of compute nodes and millions of pods
Scaling operations to support 5x growth in the next few years is a key business goal
Operational Complexity and Toil
Dealing with constant alerts, metrics, logs, and tracing data across a large fleet
Isolating issues, identifying root causes, and applying fixes is extremely time-consuming
Engineers spend more time troubleshooting than actually resolving problems
Limitations of Existing Tooling
Siloed tools that don't integrate well, requiring manual context switching
Steep learning curve for engineers to become proficient with all the different tools
Limited feedback loops to continuously improve the tooling and workflows
Introducing Agentic Workflows and AI Ops
What are Agentic Workflows?
Agents with specific goals, tools, and memory (short-term and long-term)
Agents can invoke actions and have a tight observation loop to monitor performance
Different types of agents: simple assistants, deterministic, autonomous, and multi-agent collaboration
Benefits of Agentic Workflows
Correlate telemetry signals (metrics, logs, traces) to identify root causes faster
Provide intelligent recommendations and remediation steps to resolve issues
Automate repetitive tasks and reduce human operational toil
Salesforce's Agentic Workflow Prototype
Integrated three agents:
Prometheus agent to fetch metrics and utilization data
KGPT agent to analyze Kubernetes events and pod logs
Argo CD agent to perform remediation actions (e.g., scaling, restarting pods)
Centralized "collaborator" agent that orchestrates the individual agents
Allows operators to ask natural language questions and get automated responses
Salesforce's Journey to AI-Powered Self-Healing
Evolving from Siloed Tools to Agentic Workflows
Existing tools (Sloop, Periscope, Cube Magic Mirror) were helpful but had limitations
Needed a more integrated solution to correlate data, identify issues, and automate remediation
Building the AI Ops Framework
Leveraged an internal agentic framework and embedded RaG database for data governance
Implemented a multi-agent architecture with a "manager" agent orchestrating worker agents
Worker agents fetch telemetry data from various sources (logs, metrics, events, traces)
Manager agent uses LLMs and runbook knowledge to diagnose issues and generate remediation plans
Ensuring Safe Operations
Implemented "safe operations" using Argo workflows with necessary guardrails
Respecting disruption budgets, scaling limits, and other operational constraints
Requiring human approval for critical actions before execution in production
Established progressive autonomy, starting with full human oversight and gradually increasing AI autonomy
Key Outcomes and Learnings
30% improvement in troubleshooting time
150 hours per month saved in operational toil
Importance of well-structured runbooks and knowledge graphs for complex issue diagnosis
Continuous feedback loops to improve agent performance and runbook accuracy
Leveraging existing tools and integrating them into the agentic framework
Future Directions
Connecting the Dots with Knowledge Graphs
Capturing infrastructure topology, component relationships, and failure modes in a structured knowledge graph
Enabling AI to traverse the knowledge graph and diagnose complex, multi-component issues
Leveraging Feedback and Historical Data
Recording successes and failures to improve the accuracy of root cause analysis and remediation
Using historical data to accelerate the diagnosis process and provide more reliable recommendations
Exploring AI for Anomaly Detection and Performance Troubleshooting
Empowering AI to analyze vast amounts of metrics, logs, and trace data to uncover hidden issues
Identifying anomalies and performance problems that are difficult for humans to detect
Your Digital Journey deserves a great story. Build one with us.