TalksAWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

Implementing Observability at Scale: A Blueprint for Success

Enterprise Observability Challenges

Modern enterprises often have:
- Hundreds or thousands of accounts across multiple regions
- Thousands or even millions of microservices and services
- Petabytes of telemetry data daily
Common challenges include:
- Siloed tools for logs, traces, and metrics
- Alert fatigue and missed critical issues
- Increased mean time to resolution (MTTR)
- Inconsistent coverage and data silos across teams

The Business Case for Observability

Downtime can cost enterprises $300,000 to $5 million per hour
Observability investments can provide a strong ROI by preventing downtime
Downtime also leads to lost productivity, customer experience issues, and delayed feature delivery

Shifting to Business Outcome Metrics

Customers care about business outcomes, not just technical metrics
Measuring business metrics (e.g. orders per minute) can provide early warning of issues
Business metrics also quantify the impact of technical problems

Centralized Logging and Alarm Management

New centralized logging feature allows aggregating logs across accounts and regions
Multi-resource alarms enable creating a single alarm to monitor thousands of resources
Reduces alarm sprawl and enables consistent threshold management

Tracing and Transaction Search

Transaction search allows querying 100% of traces in real-time
Adds custom attributes to traces for business context
Enables correlation of traces, logs, and metrics without sampling gaps

Anomaly Detection for Metrics, Logs, and Traces

Metrics anomaly detection uses baselines that account for seasonality
Log anomaly detection identifies changes in frequency, new patterns, and disappeared patterns
Trace anomaly detection in X-Ray analyzes latency, errors, and service dependencies

Automated Observability with Application Signals

Automatic instrumentation for Python, Java, .NET, and Node.js applications
Collects traces, metrics, and logs without manual code changes
Provides service-level observability and SLO management out-of-the-box

Specialized Observability Insights

Container Insights for EKS and ECS monitoring
Database Insights for RDS performance analysis and recommendations
Lambda Insights for function-level metrics and cold start analysis

AI-Driven Investigations and Runbook Automation

Cloudwatch Investigations uses AI to identify root causes across metrics, logs, and traces
Integrates with Systems Manager for automated remediation runbooks
Enables rapid incident response and proactive issue resolution

Implementing a Comprehensive Observability Strategy

Set up a centralized monitoring account and configure log centralization
Deploy alerting frameworks and standardize logging/metrics
Enable out-of-the-box insights like Container Insights and Database Insights
Leverage Application Signals for automatic application instrumentation
Use advanced capabilities like Contributor Insights, Anomaly Detection, and Cloudwatch Investigations
Integrate with incident management workflows and automate remediation

Key Takeaways

Enterprises must address observability challenges at scale to reduce downtime costs and improve reliability
Measuring business outcomes is critical for early issue detection and quantifying impact
AWS provides a comprehensive set of observability tools to centralize, analyze, and automate monitoring
A phased approach starting with foundational capabilities and progressing to advanced AI-driven insights is recommended
Integrating observability into incident management and remediation workflows maximizes the business value

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

Implementing Observability at Scale: A Blueprint for Success

Enterprise Observability Challenges

The Business Case for Observability

Shifting to Business Outcome Metrics

Centralized Logging and Alarm Management

Tracing and Transaction Search

Anomaly Detection for Metrics, Logs, and Traces

Automated Observability with Application Signals

Specialized Observability Insights

AI-Driven Investigations and Runbook Automation

Implementing a Comprehensive Observability Strategy

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

Implementing Observability at Scale: A Blueprint for Success

Enterprise Observability Challenges

The Business Case for Observability

Shifting to Business Outcome Metrics

Centralized Logging and Alarm Management

Tracing and Transaction Search

Anomaly Detection for Metrics, Logs, and Traces

Automated Observability with Application Signals

Specialized Observability Insights

AI-Driven Investigations and Runbook Automation

Implementing a Comprehensive Observability Strategy

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.