Elevating Application Reliability: Insights from AWS re:Invent 2025
The Cost of Downtime
Downtime can have a significant financial impact on enterprises, with 9% of enterprises losing over $300,000 per hour and 41% losing $1-5 million per hour.
Downtime also leads to missed opportunities, damaged customer relationships, loss of productivity, and harm to brand reputation.
Reliability is not just an IT problem, but a critical business survival issue.
Resilience Foundations and Best Practices
Redundancy Across Multiple Availability Zones
Availability zones can fail, so it's important to design for redundancy across multiple AZs.
Services like Elastic Load Balancing and RDS Multi-AZ can help provide failover and load balancing across AZs.
Infrastructure as Code
Treating infrastructure like application code, using tools like the AWS CDK, can help prevent issues from manual "click-ops" and make it easier to roll back changes.
Alarms and Automation
Alarms can help trigger recovery actions before issues are discovered.
Autoscaling, both at the infrastructure and application level, can help handle spikes in demand.
Managed services like Lambda and DynamoDB can abstract away infrastructure management.
Game Days and Fault Injection
Regular "game days" to test failure scenarios can help prepare teams for real incidents.
AWS Fault Injection Service and Resilience Hub can be used to inject faults and analyze application resilience.
Balancing Resilience and Cost
Highly resilient architectures may not always be cost-effective for non-critical applications.
It's important to understand the trade-offs and risks when designing for resilience.
Observability and Reliability Improvement Cycle
Measuring What Matters
Track business metrics (e.g. revenue impact, transactions), user experience metrics (e.g. core web vitals), and service health metrics (e.g. latency, errors).
Define Service Level Objectives (SLOs) to make reliability measurable and set targets.
Amazon CloudWatch Application Signals
Automatically instruments applications to collect metrics, traces, and logs without code changes.
Provides service topology visualization, pre-built dashboards, and native SLO tracking.
Detecting Failures: Lagging vs. Leading Indicators
Lagging indicators (e.g. error spikes) show problems that have already occurred.
Leading indicators (e.g. gradually increasing latency) can help detect issues before they impact customers.
CloudWatch features like Logs Anomaly Detection and Metrics Anomaly Detection can help surface both types of indicators.
Investigating Incidents with Amazon CloudWatch Investigator
Automatically correlates metrics, logs, traces, and other telemetry data to identify root causes.
Generates AI-powered findings, hypotheses, and suggested actions to resolve issues quickly.
Provides a standardized incident report to share with stakeholders.
Resolving Issues with Generative AI
Simulating User Interactions with Kira CLI
The Kira CLI can be used to programmatically interact with applications and simulate user actions.
It can then automatically investigate the underlying AWS infrastructure to identify root causes.
Once the issue is diagnosed, Kira CLI can take remediation actions, such as scaling up resources.
Balancing Criticality and Resilience
When designing resilient systems, it's important to consider the business criticality and balance the cost of resilience accordingly.
Providing alternative fallback options (e.g. candles for lighting) can be a cost-effective way to maintain user experience during outages.
Key Takeaways
Redundancy, infrastructure as code, alarms, and automation are critical for building resilient systems.
Observability, with tools like CloudWatch Application Signals, is key for detecting and investigating issues.
Generative AI can be leveraged to quickly simulate, diagnose, and resolve application problems.
Balancing resilience with cost and criticality is important when designing reliable architectures.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.