AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

Scaling Observability for the AI Era: From GPUs to LLMs

Overview of AI Observability Challenges

The observability challenges in the AI era go beyond traditional large-scale cloud-native workloads

Key challenges include:

Ensuring model behavior and accuracy
Managing token economics for AI use cases
Understanding complex dependencies in AI architectures
Monitoring GPU infrastructure for training and inference

Model Training Observability

Efficient model training is a competitive advantage, requiring observability into:

Data set quality and ingestion
Training job performance and GPU utilization

Chronosphere provides visibility into training metrics, GPU telemetry, and distributed tracing to identify and resolve issues

Inference Hosting Observability

Hosting AI inference models requires observing service reliability, performance, and scalability

Monitoring inference-specific metrics like hallucination rate, bias, and toxic response rates is crucial

Chronosphere's anomaly detection and root cause analysis help identify and resolve inference issues

AI-Native Product Observability

AI-native products leverage large language models (LLMs) and retrieval-augmented generation (RAG) for dynamic functionality

Key observability needs include:

Monitoring model accuracy and performance against token economics
Identifying and mitigating issues like hallucinations, bias, and excess token consumption

Chronosphere's open-source instrumentation and tracing provide visibility into LLM-specific attributes and evaluation metrics

Technical Details and Capabilities

Chronosphere leverages open-source technologies like OpenTelemetry, NVIDIA DCGM, and the Open Inference SDK

Provides detailed tracing and metrics for training, inference, and AI-native workflows

Enables anomaly detection, root cause analysis, and custom evaluations to proactively identify and resolve AI observability issues

Business Impact

Efficient model training and reliable inference hosting are critical for staying competitive in the AI era

Observability is key to ensuring AI models behave as expected, avoid harmful biases, and provide a positive user experience

AI-native products can differentiate through the effective use of AI, but require observability to manage token economics and model performance

Examples and Use Cases

Hallucinations: Observing and alerting on instances where the AI model provides incorrect or nonsensical responses

Bias: Monitoring for biases in AI-powered hiring or other high-impact workflows

Excess token consumption: Identifying and optimizing prompts that lead to unnecessary token usage and cost

AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

Scaling Observability for the AI Era: From GPUs to LLMs

Overview of AI Observability Challenges

Model Training Observability

Inference Hosting Observability

AI-Native Product Observability

Technical Details and Capabilities

Business Impact

Examples and Use Cases

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Scaling Observability for the AI Era: From GPUs to LLMs (AIM121)

Scaling Observability for the AI Era: From GPUs to LLMs

Overview of AI Observability Challenges

Model Training Observability

Inference Hosting Observability

AI-Native Product Observability

Technical Details and Capabilities

Business Impact

Examples and Use Cases

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.