TalksAWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

Scaling Observability for the AI Era: From GPUs to LLMs

Overview of AI Observability Challenges

  • The observability challenges in the AI era go beyond traditional large-scale cloud-native workloads
  • Key challenges include:
    • Ensuring model behavior and accuracy
    • Managing token economics for AI use cases
    • Understanding complex dependencies in AI architectures
    • Monitoring GPU infrastructure for training and inference

Model Training Observability

  • Efficient model training is a competitive advantage, requiring observability into:
    • Data set quality and ingestion
    • Training job performance and GPU utilization
  • Chronosphere provides visibility into training metrics, GPU telemetry, and distributed tracing to identify and resolve issues

Inference Hosting Observability

  • Hosting AI inference models requires observing service reliability, performance, and scalability
  • Monitoring inference-specific metrics like hallucination rate, bias, and toxic response rates is crucial
  • Chronosphere's anomaly detection and root cause analysis help identify and resolve inference issues

AI-Native Product Observability

  • AI-native products leverage large language models (LLMs) and retrieval-augmented generation (RAG) for dynamic functionality
  • Key observability needs include:
    • Monitoring model accuracy and performance against token economics
    • Identifying and mitigating issues like hallucinations, bias, and excess token consumption
  • Chronosphere's open-source instrumentation and tracing provide visibility into LLM-specific attributes and evaluation metrics

Technical Details and Capabilities

  • Chronosphere leverages open-source technologies like OpenTelemetry, NVIDIA DCGM, and the Open Inference SDK
  • Provides detailed tracing and metrics for training, inference, and AI-native workflows
  • Enables anomaly detection, root cause analysis, and custom evaluations to proactively identify and resolve AI observability issues

Business Impact

  • Efficient model training and reliable inference hosting are critical for staying competitive in the AI era
  • Observability is key to ensuring AI models behave as expected, avoid harmful biases, and provide a positive user experience
  • AI-native products can differentiate through the effective use of AI, but require observability to manage token economics and model performance

Examples and Use Cases

  • Hallucinations: Observing and alerting on instances where the AI model provides incorrect or nonsensical responses
  • Bias: Monitoring for biases in AI-powered hiring or other high-impact workflows
  • Excess token consumption: Identifying and optimizing prompts that lead to unnecessary token usage and cost

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.