TalksAWS re:Invent 2025 - AI-powered SaaS Observability using OpenSearch (ISV313)

AWS re:Invent 2025 - AI-powered SaaS Observability using OpenSearch (ISV313)

AI-powered SaaS Observability using OpenSearch

Overview

  • The presentation showcased how AI can be leveraged to improve observability and troubleshooting in a multi-tenant SaaS application running on AWS.
  • The key focus areas were:
    1. Natural language query generation in OpenSearch Dashboards
    2. Semantic log search using vector embeddings
    3. Integrating AI-powered root cause analysis and mitigation using Model Context Protocol (MCP)

Multi-Tenant SaaS Demo Application

  • The presenters built a demo application based on the open-source OpenTelemetry demo, modified to run as a multi-tenant SaaS on Amazon EKS.
  • The application consists of several microservices (e.g. shipping, billing, checkout) running in separate namespaces for each tenant.
  • Observability data (logs, metrics, traces) from the application is collected by an OpenTelemetry collector and ingested into Amazon OpenSearch Service.

Natural Language Query Generation

  • When faced with a 504 Gateway Timeout issue, the presenters first tried to troubleshoot using the OpenSearch Dashboards.
  • However, sifting through the large volume of raw logs was challenging, so they leveraged AI to generate a more targeted query.
  • By providing a natural language prompt about the issue, the AI was able to identify relevant log entries mentioning "rate limit exceeded" in the shared shipping service.
  • The AI also generated a corresponding OpenSearch query (APL) to further investigate the issue.

Semantic Log Search

  • To address the challenge of not knowing what to search for, the presenters implemented semantic log search using vector embeddings.
  • They set up an ingestion pipeline in OpenSearch Ingestion to sample log data and automatically generate vector embeddings using Amazon Bedrock's Titan Text Embeddings v2 model.
  • This allowed them to perform semantic searches on the log data, finding relevant entries even when the wording didn't exactly match their query (e.g. "something is taking too long" matched the "rate limit exceeded" log entry).

AI-powered Root Cause Analysis and Mitigation

  • To bring everything together, the presenters integrated an AI agent (using Curo CLI) that could leverage the semantic search capabilities to identify the root cause.
  • The agent was able to confirm the issue was related to the shared shipping service experiencing a "noisy neighbor" problem, with one tenant overloading the service.
  • Furthermore, the agent provided a recommended mitigation action to scale the shipping service deployment, which the presenters then implemented successfully.

Key Takeaways

  • Leveraging AI can significantly improve observability and troubleshooting in complex, distributed SaaS applications.
  • Techniques like natural language query generation, semantic log search, and AI-powered root cause analysis can help teams quickly identify and resolve issues.
  • Integrating these AI capabilities into the observability workflow, as demonstrated with the MCP agent, can further streamline the troubleshooting process.
  • The presenters provided a comprehensive GitHub gist with all the code, configurations, and references needed to implement these AI-powered observability capabilities.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.