TalksAWS re:Invent 2025 - Driving Resilience with Assurance and Visibility from Edge to Cloud (COP101)

AWS re:Invent 2025 - Driving Resilience with Assurance and Visibility from Edge to Cloud (COP101)

Driving Resilience with Assurance and Visibility from Edge to Cloud

Overview

This presentation from AWS re:Invent 2025 focuses on how organizations can achieve resilience and reliability across their distributed cloud and edge infrastructure. The key topics covered include:

  • Ensuring end-to-end visibility and observability from the edge to the cloud
  • Implementing proactive assurance and automated remediation to maintain system health
  • Leveraging AI/ML-powered analytics to predict and prevent issues before they impact customers
  • Aligning operational and business metrics to drive resilience and business continuity

Achieving End-to-End Visibility

  • Importance of having a unified view across all cloud, on-premises, and edge environments
  • Leveraging AWS services like Amazon CloudWatch, AWS X-Ray, and AWS IoT Core to collect telemetry data
  • Integrating with third-party monitoring and observability tools for comprehensive visibility
  • Applying AI/ML-powered analytics to detect anomalies, identify root causes, and predict issues

Proactive Assurance and Automated Remediation

  • Implementing automated health checks, anomaly detection, and self-healing capabilities
  • Leveraging AWS Systems Manager, AWS Lambda, and AWS Step Functions for automated remediation
  • Defining custom health models and SLIs/SLOs to measure and maintain system reliability
  • Automating the deployment of resilience patterns like circuit breakers, retries, and failovers

Aligning Operational and Business Metrics

  • Linking technical metrics (e.g., latency, error rates, resource utilization) to business KPIs
  • Using AI/ML to correlate operational data with customer experience and revenue impact
  • Establishing real-time dashboards to provide visibility into the business impact of system health
  • Empowering cross-functional teams to make data-driven decisions that optimize for resilience

Edge-to-Cloud Resilience Use Cases

  • Maintaining reliable connectivity and data processing at the edge during network disruptions
  • Ensuring consistent user experience and low latency for mobile and IoT applications
  • Dynamically scaling and load-balancing workloads across cloud and edge environments
  • Securely managing and updating edge devices with minimal manual intervention

Key Takeaways

  • Achieving end-to-end visibility and observability is crucial for maintaining resilience across distributed systems
  • Proactive assurance and automated remediation can help organizations prevent issues and maintain high availability
  • Aligning operational and business metrics enables data-driven decision-making to optimize for resilience and business continuity
  • Leveraging edge-to-cloud resilience patterns can help organizations deliver reliable, low-latency experiences for customers

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.