TalksAWS re:Invent 2025 - Elevating application reliability (COP336)

AWS re:Invent 2025 - Elevating application reliability (COP336)

Elevating Application Reliability: Insights from AWS re:Invent 2025

The Cost of Downtime

Downtime can have a significant financial impact on enterprises, with 9% of enterprises losing over $300,000 per hour and 41% losing $1-5 million per hour.
Downtime also leads to missed opportunities, damaged customer relationships, loss of productivity, and harm to brand reputation.
Reliability is not just an IT problem, but a critical business survival issue.

Resilience Foundations and Best Practices

Redundancy Across Multiple Availability Zones

Availability zones can fail, so it's important to design for redundancy across multiple AZs.
Services like Elastic Load Balancing and RDS Multi-AZ can help provide failover and load balancing across AZs.

Infrastructure as Code

Treating infrastructure like application code, using tools like the AWS CDK, can help prevent issues from manual "click-ops" and make it easier to roll back changes.

Alarms and Automation

Alarms can help trigger recovery actions before issues are discovered.
Autoscaling, both at the infrastructure and application level, can help handle spikes in demand.
Managed services like Lambda and DynamoDB can abstract away infrastructure management.

Game Days and Fault Injection

Regular "game days" to test failure scenarios can help prepare teams for real incidents.
AWS Fault Injection Service and Resilience Hub can be used to inject faults and analyze application resilience.

Balancing Resilience and Cost

Highly resilient architectures may not always be cost-effective for non-critical applications.
It's important to understand the trade-offs and risks when designing for resilience.

Observability and Reliability Improvement Cycle

Measuring What Matters

Track business metrics (e.g. revenue impact, transactions), user experience metrics (e.g. core web vitals), and service health metrics (e.g. latency, errors).
Define Service Level Objectives (SLOs) to make reliability measurable and set targets.

Amazon CloudWatch Application Signals

Automatically instruments applications to collect metrics, traces, and logs without code changes.
Provides service topology visualization, pre-built dashboards, and native SLO tracking.

Detecting Failures: Lagging vs. Leading Indicators

Lagging indicators (e.g. error spikes) show problems that have already occurred.
Leading indicators (e.g. gradually increasing latency) can help detect issues before they impact customers.
CloudWatch features like Logs Anomaly Detection and Metrics Anomaly Detection can help surface both types of indicators.

Investigating Incidents with Amazon CloudWatch Investigator

Automatically correlates metrics, logs, traces, and other telemetry data to identify root causes.
Generates AI-powered findings, hypotheses, and suggested actions to resolve issues quickly.
Provides a standardized incident report to share with stakeholders.

Resolving Issues with Generative AI

Simulating User Interactions with Kira CLI

The Kira CLI can be used to programmatically interact with applications and simulate user actions.
It can then automatically investigate the underlying AWS infrastructure to identify root causes.
Once the issue is diagnosed, Kira CLI can take remediation actions, such as scaling up resources.

Balancing Criticality and Resilience

When designing resilient systems, it's important to consider the business criticality and balance the cost of resilience accordingly.
Providing alternative fallback options (e.g. candles for lighting) can be a cost-effective way to maintain user experience during outages.

Key Takeaways

Redundancy, infrastructure as code, alarms, and automation are critical for building resilient systems.
Observability, with tools like CloudWatch Application Signals, is key for detecting and investigating issues.
Generative AI can be leveraged to quickly simulate, diagnose, and resolve application problems.
Balancing resilience with cost and criticality is important when designing reliable architectures.

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Elevating application reliability (COP336)

Elevating Application Reliability: Insights from AWS re:Invent 2025

The Cost of Downtime

Resilience Foundations and Best Practices

Redundancy Across Multiple Availability Zones

Infrastructure as Code

Alarms and Automation

Game Days and Fault Injection

Balancing Resilience and Cost

Observability and Reliability Improvement Cycle

Measuring What Matters

Amazon CloudWatch Application Signals

Detecting Failures: Lagging vs. Leading Indicators

Investigating Incidents with Amazon CloudWatch Investigator

Resolving Issues with Generative AI

Simulating User Interactions with Kira CLI

Balancing Criticality and Resilience

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Elevating application reliability (COP336)

Elevating Application Reliability: Insights from AWS re:Invent 2025

The Cost of Downtime

Resilience Foundations and Best Practices

Redundancy Across Multiple Availability Zones

Infrastructure as Code

Alarms and Automation

Game Days and Fault Injection

Balancing Resilience and Cost

Observability and Reliability Improvement Cycle

Measuring What Matters

Amazon CloudWatch Application Signals

Detecting Failures: Lagging vs. Leading Indicators

Investigating Incidents with Amazon CloudWatch Investigator

Resolving Issues with Generative AI

Simulating User Interactions with Kira CLI

Balancing Criticality and Resilience

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.