Talks AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121) VIDEO
AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121) Scaling Observability for the AI Era: From GPUs to LLMs
Overview of AI Observability Challenges
The observability challenges in the AI era go beyond traditional large-scale cloud-native workloads
Key challenges include:
Ensuring model behavior and accuracy
Managing token economics for AI use cases
Understanding complex dependencies in AI architectures
Monitoring GPU infrastructure for training and inference
Model Training Observability
Efficient model training is a competitive advantage, requiring observability into:
Data set quality and ingestion
Training job performance and GPU utilization
Chronosphere provides visibility into training metrics, GPU telemetry, and distributed tracing to identify and resolve issues
Inference Hosting Observability
Hosting AI inference models requires observing service reliability, performance, and scalability
Monitoring inference-specific metrics like hallucination rate, bias, and toxic response rates is crucial
Chronosphere's anomaly detection and root cause analysis help identify and resolve inference issues
AI-Native Product Observability
AI-native products leverage large language models (LLMs) and retrieval-augmented generation (RAG) for dynamic functionality
Key observability needs include:
Monitoring model accuracy and performance against token economics
Identifying and mitigating issues like hallucinations, bias, and excess token consumption
Chronosphere's open-source instrumentation and tracing provide visibility into LLM-specific attributes and evaluation metrics
Technical Details and Capabilities
Chronosphere leverages open-source technologies like OpenTelemetry, NVIDIA DCGM, and the Open Inference SDK
Provides detailed tracing and metrics for training, inference, and AI-native workflows
Enables anomaly detection, root cause analysis, and custom evaluations to proactively identify and resolve AI observability issues
Business Impact
Efficient model training and reliable inference hosting are critical for staying competitive in the AI era
Observability is key to ensuring AI models behave as expected, avoid harmful biases, and provide a positive user experience
AI-native products can differentiate through the effective use of AI, but require observability to manage token economics and model performance
Examples and Use Cases
Hallucinations: Observing and alerting on instances where the AI model provides incorrect or nonsensical responses
Bias: Monitoring for biases in AI-powered hiring or other high-impact workflows
Excess token consumption: Identifying and optimizing prompts that lead to unnecessary token usage and cost
Your Digital Journey deserves a great story. Build one with us.