TalksAWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB

Resilience and Availability

  • Resilience is the ability of a workload to recover from infrastructure or service disruptions
  • Resilience has technical drivers (less downtime, higher availability, lower latency) and business drivers (revenue, customer trust, public image)
  • Common failure scenarios include disasters, data corruption, core infrastructure issues, and configuration/deployment problems
  • Disaster recovery focuses on backups and slow recovery, while high availability aims to keep systems running across multiple sites

Multi-AZ Resilience with ELB

  • ELB provides resilience by scaling, distributing traffic, and performing health checks
  • ELB publishes healthy zone IPs in DNS, using a 60-second TTL to enable fast failover
  • ELB uses a two-tier health check system:
    • Route 53 checks the load balancer nodes
    • Load balancer nodes check the targets
  • This allows ELB to reroute traffic from unhealthy targets or zones during failures
  • Configuring early intervention thresholds (e.g. 30% unhealthy) can trigger failover before total failure

Cross-Zone Load Balancing

  • Cross-zone load balancing distributes traffic across all targets, even in different zones
  • This helps avoid disproportionate load on zones with fewer targets
  • However, maintaining an even number of healthy targets per zone is still recommended for static stability

DNS Failover Mechanisms

  • ELB removes unhealthy zone IPs from DNS to route traffic away from failed zones
  • ELB fails open, returning all IPs even if unhealthy, to avoid complete failure
  • This can mask other failures, so monitoring for "fail open" mode is important
  • Configurable health thresholds (e.g. 30% unhealthy) can trigger earlier failover

Observability and Monitoring

  • Monitor both negative (errors) and positive (successful requests, healthy hosts) metrics
  • Use Cloudwatch composite alarms to correlate issues across the stack
  • Analyze single-zone vs multi-zone failures to identify root causes

Client Best Practices

  • Clients should properly handle connection management, retries, and exponential backoff
  • This enables balanced connections, graceful failure handling, and faster recovery

Multi-Region Resilience

  • Multi-region provides an additional layer of blast radius isolation
  • Key challenges are aligning stakeholders and keeping the architecture simple
  • Leverage global services like Route 53, CloudFront, and Global Accelerator for multi-region traffic routing
  • Use Route 53 failover records to shift traffic between primary and backup regions
  • Implement DNS load shedding to gracefully degrade during overload

Deployment Strategies

  • Automate testing and change management to mitigate human-caused failures
  • Use progressive deployment strategies (e.g. one-box, zonal rollouts) to detect issues early
  • Implement graceful degradation to preserve core functionality during failures
  • Leverage feature toggles, caching, and load shedding to maintain service during disruptions

Key Takeaways

  • Leverage ELB's multi-AZ resilience features, including health checks and DNS failover
  • Maintain static stability by pre-provisioning capacity and evenly distributing targets
  • Monitor both positive and negative metrics to quickly identify the root cause of issues
  • Design clients to handle failures gracefully through connection management and retries
  • Adopt a multi-region architecture to increase blast radius isolation and enable faster recovery
  • Automate testing and deployment to mitigate human-caused failures
  • Implement graceful degradation strategies to preserve core functionality during disruptions

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.