Modern applications have a highly federated and hybrid nature, with components spread across on-premises, cloud, and SaaS providers.
This leads to complex incident triage processes, often involving 140-170 specialists to diagnose and resolve issues.
IT spending is projected to increase by 9.8% in the next year, much of which is likely fueling LLM development.
The adoption of agentic AI and LLMs is rapidly increasing, with a 400% increase in the use of Amazon Bedrock this year alone.
As LLMs become more prevalent in application development, the velocity of new systems and metrics entering the operations domain will accelerate.
Leveraging LLMs for Proactive Operational Insights
Rather than a reactive, siloed approach to incident management, LLMs can be used to proactively analyze logs and metrics to identify patterns and anomalies.
LLMs can be integrated with tools like ServiceNow to automatically raise incidents with relevant data, preparing the right teams to triage the issue.
Feedback loops can be established to continuously improve the LLM's understanding of the environment and common failure modes.
Technologies like Amazon Bedrock's agent action groups can be used to aggregate data from various sources (load balancers, AWS Health, support APIs, CloudWatch) to feed into the LLM-powered incident management process.
HSBC's Adoption of LLMs in Change Management
HSBC faced challenges in adopting LLMs for highly regulated processes like change management, due to the need for reliability and the risk of false positives/negatives.
They developed a "Service Management Quality Assurance" (SMQA) application that uses LLMs to automate the review of change requests, incident records, and problem tasks.
SMQA analyzes change details, testing outputs, implementation plans, and other metadata to flag changes that require manual review.
The solution was integrated with ServiceNow, allowing for automated quality assurance within the change management pipeline.
Parallel testing showed the LLM-powered solution was more effective at identifying anomalies than manual review, enabling 100% automated change reviews and a reduction in disruptive changes.
Considerations for Operationalizing LLMs
Agent and MCP Server Management:
Decide whether to build custom agents, use vendor-provided agents, or allow decentralized agent development.
Establish architectural principles and governance around agent and MCP server deployment.
Data Sensitivity and Security:
Operational data (e.g., logs) may contain sensitive information not originally intended for LLM exposure.
Use tools like Amazon Bedrock's guard rails to control data flows and obfuscate sensitive information.
Model and Agent Access Management:
Carefully manage access permissions, security groups, and firewalls to control what data the LLMs and agents can access.
Consider granular access controls to limit the scope of what individual agents and models can retrieve.
Monitoring and Observability:
Monitor the performance and cost of LLM usage, including token consumption and response times.
Ensure you can track model upgrades and understand their impact on performance.
Traceability and Auditability:
Establish robust traceability mechanisms to log what data was accessed by the LLMs, what outputs were generated, and who consumed that information.
Ensure the integrity of this audit trail to support forensic analysis and legal requirements.
Key Takeaways
LLMs can transform operational intelligence by enabling proactive detection of issues, reducing manual triage efforts, and improving incident response.
Careful planning and governance around agent/MCP management, data security, access controls, and observability are critical for successful LLM operationalization.
HSBC's SMQA solution demonstrates the potential for LLMs to enhance strategic processes like change management, even in highly regulated industries.
By establishing feedback loops and continuously improving the LLM's understanding of the environment, organizations can unlock significant productivity gains and operational resilience.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.