AWS re:Invent 2025 - AI-powered SaaS Observability using OpenSearch (ISV313)

AI-powered SaaS Observability using OpenSearch

Overview

The presentation showcased how AI can be leveraged to improve observability and troubleshooting in a multi-tenant SaaS application running on AWS.

The key focus areas were:

Natural language query generation in OpenSearch Dashboards
Semantic log search using vector embeddings
Integrating AI-powered root cause analysis and mitigation using Model Context Protocol (MCP)

Multi-Tenant SaaS Demo Application

The presenters built a demo application based on the open-source OpenTelemetry demo, modified to run as a multi-tenant SaaS on Amazon EKS.

The application consists of several microservices (e.g. shipping, billing, checkout) running in separate namespaces for each tenant.

Observability data (logs, metrics, traces) from the application is collected by an OpenTelemetry collector and ingested into Amazon OpenSearch Service.

Natural Language Query Generation

When faced with a 504 Gateway Timeout issue, the presenters first tried to troubleshoot using the OpenSearch Dashboards.

However, sifting through the large volume of raw logs was challenging, so they leveraged AI to generate a more targeted query.

By providing a natural language prompt about the issue, the AI was able to identify relevant log entries mentioning "rate limit exceeded" in the shared shipping service.

The AI also generated a corresponding OpenSearch query (APL) to further investigate the issue.

Semantic Log Search

To address the challenge of not knowing what to search for, the presenters implemented semantic log search using vector embeddings.

They set up an ingestion pipeline in OpenSearch Ingestion to sample log data and automatically generate vector embeddings using Amazon Bedrock's Titan Text Embeddings v2 model.

This allowed them to perform semantic searches on the log data, finding relevant entries even when the wording didn't exactly match their query (e.g. "something is taking too long" matched the "rate limit exceeded" log entry).

AI-powered Root Cause Analysis and Mitigation

To bring everything together, the presenters integrated an AI agent (using Curo CLI) that could leverage the semantic search capabilities to identify the root cause.

The agent was able to confirm the issue was related to the shared shipping service experiencing a "noisy neighbor" problem, with one tenant overloading the service.

Furthermore, the agent provided a recommended mitigation action to scale the shipping service deployment, which the presenters then implemented successfully.

Key Takeaways

Leveraging AI can significantly improve observability and troubleshooting in complex, distributed SaaS applications.

Techniques like natural language query generation, semantic log search, and AI-powered root cause analysis can help teams quickly identify and resolve issues.

Integrating these AI capabilities into the observability workflow, as demonstrated with the MCP agent, can further streamline the troubleshooting process.

The presenters provided a comprehensive GitHub gist with all the code, configurations, and references needed to implement these AI-powered observability capabilities.

AWS re:Invent 2025 - AI-powered SaaS Observability using OpenSearch (ISV313)

AI-powered SaaS Observability using OpenSearch

Overview

Multi-Tenant SaaS Demo Application

Natural Language Query Generation

Semantic Log Search

AI-powered Root Cause Analysis and Mitigation

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - AI-powered SaaS Observability using OpenSearch (ISV313)

AI-powered SaaS Observability using OpenSearch

Overview

Multi-Tenant SaaS Demo Application

Natural Language Query Generation

Semantic Log Search

AI-powered Root Cause Analysis and Mitigation

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.