Talks AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328) VIDEO
AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328) Implementing Observability at Scale: A Blueprint for Success
Enterprise Observability Challenges
Modern enterprises often have:
Hundreds or thousands of accounts across multiple regions
Thousands or even millions of microservices and services
Petabytes of telemetry data daily
Common challenges include:
Siloed tools for logs, traces, and metrics
Alert fatigue and missed critical issues
Increased mean time to resolution (MTTR)
Inconsistent coverage and data silos across teams
The Business Case for Observability
Downtime can cost enterprises $300,000 to $5 million per hour
Observability investments can provide a strong ROI by preventing downtime
Downtime also leads to lost productivity, customer experience issues, and delayed feature delivery
Shifting to Business Outcome Metrics
Customers care about business outcomes, not just technical metrics
Measuring business metrics (e.g. orders per minute) can provide early warning of issues
Business metrics also quantify the impact of technical problems
Centralized Logging and Alarm Management
New centralized logging feature allows aggregating logs across accounts and regions
Multi-resource alarms enable creating a single alarm to monitor thousands of resources
Reduces alarm sprawl and enables consistent threshold management
Tracing and Transaction Search
Transaction search allows querying 100% of traces in real-time
Adds custom attributes to traces for business context
Enables correlation of traces, logs, and metrics without sampling gaps
Anomaly Detection for Metrics, Logs, and Traces
Metrics anomaly detection uses baselines that account for seasonality
Log anomaly detection identifies changes in frequency, new patterns, and disappeared patterns
Trace anomaly detection in X-Ray analyzes latency, errors, and service dependencies
Automated Observability with Application Signals
Automatic instrumentation for Python, Java, .NET, and Node.js applications
Collects traces, metrics, and logs without manual code changes
Provides service-level observability and SLO management out-of-the-box
Specialized Observability Insights
Container Insights for EKS and ECS monitoring
Database Insights for RDS performance analysis and recommendations
Lambda Insights for function-level metrics and cold start analysis
AI-Driven Investigations and Runbook Automation
Cloudwatch Investigations uses AI to identify root causes across metrics, logs, and traces
Integrates with Systems Manager for automated remediation runbooks
Enables rapid incident response and proactive issue resolution
Implementing a Comprehensive Observability Strategy
Set up a centralized monitoring account and configure log centralization
Deploy alerting frameworks and standardize logging/metrics
Enable out-of-the-box insights like Container Insights and Database Insights
Leverage Application Signals for automatic application instrumentation
Use advanced capabilities like Contributor Insights, Anomaly Detection, and Cloudwatch Investigations
Integrate with incident management workflows and automate remediation
Key Takeaways
Enterprises must address observability challenges at scale to reduce downtime costs and improve reliability
Measuring business outcomes is critical for early issue detection and quantifying impact
AWS provides a comprehensive set of observability tools to centralize, analyze, and automate monitoring
A phased approach starting with foundational capabilities and progressing to advanced AI-driven insights is recommended
Integrating observability into incident management and remediation workflows maximizes the business value
Your Digital Journey deserves a great story. Build one with us.