AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB

Resilience and Availability

Resilience is the ability of a workload to recover from infrastructure or service disruptions

Resilience has technical drivers (less downtime, higher availability, lower latency) and business drivers (revenue, customer trust, public image)

Common failure scenarios include disasters, data corruption, core infrastructure issues, and configuration/deployment problems

Disaster recovery focuses on backups and slow recovery, while high availability aims to keep systems running across multiple sites

Multi-AZ Resilience with ELB

ELB provides resilience by scaling, distributing traffic, and performing health checks

ELB publishes healthy zone IPs in DNS, using a 60-second TTL to enable fast failover

ELB uses a two-tier health check system:

Route 53 checks the load balancer nodes
Load balancer nodes check the targets

This allows ELB to reroute traffic from unhealthy targets or zones during failures

Configuring early intervention thresholds (e.g. 30% unhealthy) can trigger failover before total failure

Cross-Zone Load Balancing

Cross-zone load balancing distributes traffic across all targets, even in different zones

This helps avoid disproportionate load on zones with fewer targets

However, maintaining an even number of healthy targets per zone is still recommended for static stability

DNS Failover Mechanisms

ELB removes unhealthy zone IPs from DNS to route traffic away from failed zones

ELB fails open, returning all IPs even if unhealthy, to avoid complete failure

This can mask other failures, so monitoring for "fail open" mode is important

Configurable health thresholds (e.g. 30% unhealthy) can trigger earlier failover

Observability and Monitoring

Monitor both negative (errors) and positive (successful requests, healthy hosts) metrics

Use Cloudwatch composite alarms to correlate issues across the stack

Analyze single-zone vs multi-zone failures to identify root causes

Client Best Practices

Clients should properly handle connection management, retries, and exponential backoff

This enables balanced connections, graceful failure handling, and faster recovery

Multi-Region Resilience

Multi-region provides an additional layer of blast radius isolation

Key challenges are aligning stakeholders and keeping the architecture simple

Leverage global services like Route 53, CloudFront, and Global Accelerator for multi-region traffic routing

Use Route 53 failover records to shift traffic between primary and backup regions

Implement DNS load shedding to gracefully degrade during overload

Deployment Strategies

Automate testing and change management to mitigate human-caused failures

Use progressive deployment strategies (e.g. one-box, zonal rollouts) to detect issues early

Implement graceful degradation to preserve core functionality during failures

Leverage feature toggles, caching, and load shedding to maintain service during disruptions

Key Takeaways

Leverage ELB's multi-AZ resilience features, including health checks and DNS failover

Maintain static stability by pre-provisioning capacity and evenly distributing targets

Monitor both positive and negative metrics to quickly identify the root cause of issues

Design clients to handle failures gracefully through connection management and retries

Adopt a multi-region architecture to increase blast radius isolation and enable faster recovery

Automate testing and deployment to mitigate human-caused failures

Implement graceful degradation strategies to preserve core functionality during disruptions

AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB

Resilience and Availability

Multi-AZ Resilience with ELB

Cross-Zone Load Balancing

DNS Failover Mechanisms

Observability and Monitoring

Client Best Practices

Multi-Region Resilience

Deployment Strategies

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB

Resilience and Availability

Multi-AZ Resilience with ELB

Cross-Zone Load Balancing

DNS Failover Mechanisms

Observability and Monitoring

Client Best Practices

Multi-Region Resilience

Deployment Strategies

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.