Talks AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311) VIDEO
AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311) Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB
Resilience and Availability
Resilience is the ability of a workload to recover from infrastructure or service disruptions
Resilience has technical drivers (less downtime, higher availability, lower latency) and business drivers (revenue, customer trust, public image)
Common failure scenarios include disasters, data corruption, core infrastructure issues, and configuration/deployment problems
Disaster recovery focuses on backups and slow recovery, while high availability aims to keep systems running across multiple sites
Multi-AZ Resilience with ELB
ELB provides resilience by scaling, distributing traffic, and performing health checks
ELB publishes healthy zone IPs in DNS, using a 60-second TTL to enable fast failover
ELB uses a two-tier health check system:
Route 53 checks the load balancer nodes
Load balancer nodes check the targets
This allows ELB to reroute traffic from unhealthy targets or zones during failures
Configuring early intervention thresholds (e.g. 30% unhealthy) can trigger failover before total failure
Cross-Zone Load Balancing
Cross-zone load balancing distributes traffic across all targets, even in different zones
This helps avoid disproportionate load on zones with fewer targets
However, maintaining an even number of healthy targets per zone is still recommended for static stability
DNS Failover Mechanisms
ELB removes unhealthy zone IPs from DNS to route traffic away from failed zones
ELB fails open, returning all IPs even if unhealthy, to avoid complete failure
This can mask other failures, so monitoring for "fail open" mode is important
Configurable health thresholds (e.g. 30% unhealthy) can trigger earlier failover
Observability and Monitoring
Monitor both negative (errors) and positive (successful requests, healthy hosts) metrics
Use Cloudwatch composite alarms to correlate issues across the stack
Analyze single-zone vs multi-zone failures to identify root causes
Client Best Practices
Clients should properly handle connection management, retries, and exponential backoff
This enables balanced connections, graceful failure handling, and faster recovery
Multi-Region Resilience
Multi-region provides an additional layer of blast radius isolation
Key challenges are aligning stakeholders and keeping the architecture simple
Leverage global services like Route 53, CloudFront, and Global Accelerator for multi-region traffic routing
Use Route 53 failover records to shift traffic between primary and backup regions
Implement DNS load shedding to gracefully degrade during overload
Deployment Strategies
Automate testing and change management to mitigate human-caused failures
Use progressive deployment strategies (e.g. one-box, zonal rollouts) to detect issues early
Implement graceful degradation to preserve core functionality during failures
Leverage feature toggles, caching, and load shedding to maintain service during disruptions
Key Takeaways
Leverage ELB's multi-AZ resilience features, including health checks and DNS failover
Maintain static stability by pre-provisioning capacity and evenly distributing targets
Monitor both positive and negative metrics to quickly identify the root cause of issues
Design clients to handle failures gracefully through connection management and retries
Adopt a multi-region architecture to increase blast radius isolation and enable faster recovery
Automate testing and deployment to mitigate human-caused failures
Implement graceful degradation strategies to preserve core functionality during disruptions
Your Digital Journey deserves a great story. Build one with us.