TalksAWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

AWS re:Invent 2025 - Implementing observability at scale: A blueprint for success (COP328)

Implementing Observability at Scale: A Blueprint for Success

Enterprise Observability Challenges

  • Modern enterprises often have:
    • Hundreds or thousands of accounts across multiple regions
    • Thousands or even millions of microservices and services
    • Petabytes of telemetry data daily
  • Common challenges include:
    • Siloed tools for logs, traces, and metrics
    • Alert fatigue and missed critical issues
    • Increased mean time to resolution (MTTR)
    • Inconsistent coverage and data silos across teams

The Business Case for Observability

  • Downtime can cost enterprises $300,000 to $5 million per hour
  • Observability investments can provide a strong ROI by preventing downtime
  • Downtime also leads to lost productivity, customer experience issues, and delayed feature delivery

Shifting to Business Outcome Metrics

  • Customers care about business outcomes, not just technical metrics
  • Measuring business metrics (e.g. orders per minute) can provide early warning of issues
  • Business metrics also quantify the impact of technical problems

Centralized Logging and Alarm Management

  • New centralized logging feature allows aggregating logs across accounts and regions
  • Multi-resource alarms enable creating a single alarm to monitor thousands of resources
  • Reduces alarm sprawl and enables consistent threshold management

Tracing and Transaction Search

  • Transaction search allows querying 100% of traces in real-time
  • Adds custom attributes to traces for business context
  • Enables correlation of traces, logs, and metrics without sampling gaps

Anomaly Detection for Metrics, Logs, and Traces

  • Metrics anomaly detection uses baselines that account for seasonality
  • Log anomaly detection identifies changes in frequency, new patterns, and disappeared patterns
  • Trace anomaly detection in X-Ray analyzes latency, errors, and service dependencies

Automated Observability with Application Signals

  • Automatic instrumentation for Python, Java, .NET, and Node.js applications
  • Collects traces, metrics, and logs without manual code changes
  • Provides service-level observability and SLO management out-of-the-box

Specialized Observability Insights

  • Container Insights for EKS and ECS monitoring
  • Database Insights for RDS performance analysis and recommendations
  • Lambda Insights for function-level metrics and cold start analysis

AI-Driven Investigations and Runbook Automation

  • Cloudwatch Investigations uses AI to identify root causes across metrics, logs, and traces
  • Integrates with Systems Manager for automated remediation runbooks
  • Enables rapid incident response and proactive issue resolution

Implementing a Comprehensive Observability Strategy

  1. Set up a centralized monitoring account and configure log centralization
  2. Deploy alerting frameworks and standardize logging/metrics
  3. Enable out-of-the-box insights like Container Insights and Database Insights
  4. Leverage Application Signals for automatic application instrumentation
  5. Use advanced capabilities like Contributor Insights, Anomaly Detection, and Cloudwatch Investigations
  6. Integrate with incident management workflows and automate remediation

Key Takeaways

  • Enterprises must address observability challenges at scale to reduce downtime costs and improve reliability
  • Measuring business outcomes is critical for early issue detection and quantifying impact
  • AWS provides a comprehensive set of observability tools to centralize, analyze, and automate monitoring
  • A phased approach starting with foundational capabilities and progressing to advanced AI-driven insights is recommended
  • Integrating observability into incident management and remediation workflows maximizes the business value

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.