TalksAWS re:Invent 2025 - Driving Resilience with Assurance and Visibility from Edge to Cloud (COP101)
AWS re:Invent 2025 - Driving Resilience with Assurance and Visibility from Edge to Cloud (COP101)
Driving Resilience with Assurance and Visibility from Edge to Cloud
Overview
This presentation from AWS re:Invent 2025 focuses on how organizations can achieve resilience and reliability across their distributed cloud and edge infrastructure. The key topics covered include:
Ensuring end-to-end visibility and observability from the edge to the cloud
Implementing proactive assurance and automated remediation to maintain system health
Leveraging AI/ML-powered analytics to predict and prevent issues before they impact customers
Aligning operational and business metrics to drive resilience and business continuity
Achieving End-to-End Visibility
Importance of having a unified view across all cloud, on-premises, and edge environments
Leveraging AWS services like Amazon CloudWatch, AWS X-Ray, and AWS IoT Core to collect telemetry data
Integrating with third-party monitoring and observability tools for comprehensive visibility
Applying AI/ML-powered analytics to detect anomalies, identify root causes, and predict issues
Proactive Assurance and Automated Remediation
Implementing automated health checks, anomaly detection, and self-healing capabilities
Leveraging AWS Systems Manager, AWS Lambda, and AWS Step Functions for automated remediation
Defining custom health models and SLIs/SLOs to measure and maintain system reliability
Automating the deployment of resilience patterns like circuit breakers, retries, and failovers
Aligning Operational and Business Metrics
Linking technical metrics (e.g., latency, error rates, resource utilization) to business KPIs
Using AI/ML to correlate operational data with customer experience and revenue impact
Establishing real-time dashboards to provide visibility into the business impact of system health
Empowering cross-functional teams to make data-driven decisions that optimize for resilience
Edge-to-Cloud Resilience Use Cases
Maintaining reliable connectivity and data processing at the edge during network disruptions
Ensuring consistent user experience and low latency for mobile and IoT applications
Dynamically scaling and load-balancing workloads across cloud and edge environments
Securely managing and updating edge devices with minimal manual intervention
Key Takeaways
Achieving end-to-end visibility and observability is crucial for maintaining resilience across distributed systems
Proactive assurance and automated remediation can help organizations prevent issues and maintain high availability
Aligning operational and business metrics enables data-driven decision-making to optimize for resilience and business continuity
Leveraging edge-to-cloud resilience patterns can help organizations deliver reliable, low-latency experiences for customers
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.